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Abstract —The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is 
because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. 
Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the 
quest for Al is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews 
recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, autoencoders, 
manifold learning, and deep networks. This motivates longer term unanswered questions about the appropriate objectives for learning 
good representations, for computing representations (i.e., inference), and the geometrical connections between representation 
learning, density estimation, and manifold learning. 

Index Terms —Deep learning, representation learning, feature learning, unsupervised learning, Boltzmann machine, autoencoder, 
neural nets 
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1 Introduction 

he performance of machine learning methods is heavily 
dependent on the choice of data representation (or 
features) on which they are applied. For that reason, much 
of the actual effort in deploying machine learning algo¬ 
rithms goes into the design of preprocessing pipelines and 
data transformations that result in a representation of the 
data that can support effective machine learning. Such 
feature engineering is important but labor intensive and 
highlights the weakness of current learning algorithms: 
Their inability to extract and organize the discriminative 
information from the data. Feature engineering is a way to 
take advantage of human ingenuity and prior knowledge to 
compensate for that weakness. To expand the scope and 
ease of applicability of machine learning, it would be highly 
desirable to make learning algorithms less dependent on 
feature engineering so that novel applications could be 
constructed faster, and more importantly, to make progress 
toward artificial intelligence (AI). An AI must fundamen¬ 
tally understand the world around us, and we argue that this 
can only be achieved if it can learn to identify and 
disentangle the underlying explanatory factors hidden in 
the observed milieu of low-level sensory data. 

This paper is about representation learning, i.e., learning 
representations of the data that make it easier to extract 
useful information when building classifiers or other 
predictors. In the case of probabilistic models, a good 
representation is often one that captures the posterior 


• The authors are with the Department of Computer Science and Operations 
Research, Universite de Montreal, PO Box 6128, Succ. Centre-Ville, 
Montreal, Quebec H3C 3/7, Canada. 

Manuscript received 9 Apr. 2012; revised 17 Oct. 2012; accepted 24, Feb. 2013; 
published online 28 Feb. 2013. 

Recommended for acceptance by S. Bengio, L. Deng, H. Larochelle, H. Lee, and 
R. Salakhutdinov. 

For information on obtaining reprints of this article, please send e-mail to: 
tpami@computer.org, and reference IEEECS Log Number 
TPAMISI-2012-04-0260. 

Digital Object Identifier no. 10.1109fTPAMI.2013.50. 

0162-8828/13/$31.00 © 2013 IEEE 


distribution of the underlying explanatory factors for the 
observed input. A good representation is also one that is 
useful as input to a supervised predictor. Among the 
various ways of learning representations, this paper focuses 
on deep learning methods: those that are formed by the 
composition of multiple nonlinear transformations with 
the goal of yielding more abstract—and ultimately more 
useful—representations. Ffere, we survey this rapidly 
developing area with special emphasis on recent progress. 
We consider some of the fundamental questions that have 
been driving research in this area. Specifically, what makes 
one representation better than another? Given an example, 
how should we compute its representation, i.e., perform 
feature extraction? Also, what are appropriate objectives for 
learning good representations? 

2 Why Should We Care about Learning 
Representations? 

Representation learning has become a field in itself in the 
machine learning community, with regular workshops at 
the leading conferences such as NIPS and ICML, and a new 
conference dedicated to it, ICLR, 1 sometimes under the 
header of Deep Learning or Feature Learning. Although depth 
is an important part of the story, many other priors are 
interesting and can be conveniently captured when the 
problem is cast as one of learning a representation, as 
discussed in the next section. The rapid increase in scientific 
activity on representation learning has been accompanied 
and nourished by a remarkable string of empirical successes 
both in academia and in industry. Below, we briefly 
highlight some of these high points. 

2.1 Speech Recognition and Signal Processing 

Speech was one of the early applications of neural 
networks, in particular convolutional (or time-delay) neural 

1. International Conference on Learning Representations. 
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networks. 2 The recent revival of interest in neural networks, 
deep learning, and representation learning has had a strong 
impact in the area of speech recognition, with breakthrough 
results [54], [56], [183], [148], [55], [86] obtained by several 
academics as well as researchers at industrial labs bringing 
these algorithms to a larger scale and into products. For 
example, Microsoft released in 2012 a new version of their 
Microsoft Audio Video Indexing Service speech system 
based on deep learning [183]. These authors managed to 
reduce the word error rate on four major benchmarks by 
about 30 percent (e.g., from 27.4 to 18.5 percent on RT03S) 
compared to state-of-the-art models based on Gaussian 
mixtures for the acoustic modeling and trained on the same 
amount of data (309 hours of speech). The relative 
improvement in error rate obtained by Dahl et al. [55] on 
a smaller large-vocabulary speech recognition benchmark 
(Bing mobile business search dataset, with 40 hours of 
speech) is between 16 and 23 percent. 

Representation-learning algorithms have also been ap¬ 
plied to music, substantially beating the state of the art in 
polyphonic transcription [34], with relative error improve¬ 
ment between 5 and 30 percent on a standard benchmark of 
four datasets. Deep learning also helped to win MIREX 
(music information retrieval) competitions, for example, in 
2011 on audio tagging [81]. 

2.2 Object Recognition 

The beginnings of deep learning in 2006 focused on the 
MNIST digit image classification problem [94], [23], break¬ 
ing the supremacy of SVMs (1.4 percent error) on this 
dataset. 3 The latest records are still held by deep networks: 
Ciresan et al. [46] currently claim the title of state of the art 
for the unconstrained version of the task (e.g., using a 
convolutional architecture), with 0.27 percent error, and 
Rifai et al. [169] is state of the art for the knowledge-free 
version of MNIST, with 0.81 percent error. 

In the last few years, deep learning has moved from 
digits to object recognition in natural images, and the latest 
breakthrough has been achieved on the ImageNet dataset, 4 
bringing down the state-of-the-art error rate from 26.1 to 

15.3 percent [117]. 

2.3 Natural Language Processing (NLP) 

Besides speech recognition, there are many other NLP 
applications of representation learning. Distributed represen¬ 
tations for symbolic data were introduced by Hinton [87], 
and first developed in the context of statistical language 
modeling by Bengio et al. [19] in the so-called neural net 
language models [10]. They are all based on learning a 
distributed representation for each word, called a word 
embedding. Adding a convolutional architecture, Collobert 
et al. [51] developed the SENNA system 5 that shares 
representations across the tasks of language modeling, 
part-of-speech tagging, chunking, named entity recognition, 
semantic role labeling, and syntactic parsing. SENNA 

2. See [9] for a review of early work in this area. 

3. For the knowledge-free version of the task where no image-specific 
prior is used, such as image deformations or convolutions. 

4. The 1,000-class ImageNet benchmark, whose results are detailed here: 
http://www.image-net.org/challenges/LSVRC/2012/results.html. 

5. Downloadable from http://ml.nec-labs.com/senna/. 
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Fig. 1. Illustration of representation-learning discovering explanatory 
factors (middle hidden layer, in red), some explaining the input (semi- 
supervised setting), and some explaining target for each task. Because 
these subsets overlap, sharing of statistical strength helps generalization. 

approaches or surpasses the state of the art on these tasks, 
but is simpler and much faster than traditional predictors. 
Learning word embeddings can be combined with learning 
image representations in a way that allows associating text 
and images. This approach has been used successfully to 
build Google's image search, exploiting huge quantities of 
data to map images and queries in the same space [218], 
and it has recently been extended to deeper multimodal 
representations [194]. 

The neural net language model was also improved by 
adding recurrence to the hidden layers [146], allowing it to 
beat the state of the art (smoothed n-gram models) not 
only in terms of perplexity (exponential of the average 
negative log likelihood of predicting the right next word, 
going down from 140 to 102), but also in terms of word 
error rate in speech recognition (since the language model 
is an important component of a speech recognition 
system), decreasing it from 17.2 percent (KN5 baseline) 
or 16.9 percent (discriminative language model) to 

14.4 percent on the Wall Street Journal benchmark task. 
Similar models have been applied in statistical machine 
translation [182], [121], improving perplexity and BLEU 
scores. Recursive autoencoders (which generalize recurrent 
networks) have also been used to beat the state of the art 
in full sentence paraphrase detection [192], almost dou¬ 
bling the FI score for paraphrase detection. Representation 
learning can also be used to perform word sense 
disambiguation [33], bringing up the accuracy from 67.8 
to 70.2 percent on the subset of Senseval-3 where the 
system could be applied (with subject-verb-object sen¬ 
tences). Finally, it has also been successfully used to 
surpass the state of the art in sentiment analysis [70], [193]. 

2.4 Multitask and Transfer Learning, Domain 
Adaptation 

Transfer learning is the ability of a learning algorithm to 
exploit commonalities between different learning tasks to 
share statistical strength and transfer knowledge across tasks. 
As discussed below, we hypothesize that representation 
learning algorithms have an advantage for such tasks 
because they learn representations that capture underlying 
factors, a subset of which may be relevant for each 
particular task, as illustrated in Fig. 1. This hypothesis 
seems confirmed by a number of empirical results showing 
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the strengths of representation learning algorithms in 
transfer learning scenarios. 

Most impressive are the two transfer learning challenges 
held in 2011 and won by representation learning algo¬ 
rithms. First, the transfer learning challenge, presented at 
an ICML 2011 workshop of the same name, was won using 
unsupervised layerwise pretraining [12], [145]. A second 
transfer learning challenge was held the same year and 
won by Goodfellow et al. [72]. Results were presented at 
NIPS 2011's Challenges in Learning Hierarchical Models 
Workshop. In the related domain adaptation setup, the target 
remains the same but the input distribution changes [70], 
[43]. In the multitask learning setup, representation learning 
has also been found to be advantageous [117], [51] because 
of shared factors across tasks. 

3 What Makes a Representation Good? 

3.1 Priors for Representation Learning in Al 

In [16], one of us introduced the notion of Al tasks, which 
are challenging for current machine learning algorithms 
and involve complex but highly structured dependencies. 
One reason why explicitly dealing with representations is 
interesting is because they can be convenient to express 
many general priors about the world around us, i.e., priors 
that are not task specific but would be likely to be useful for 
a learning machine to solve Al tasks. Examples of such 
general-purpose priors are the following: 

• Smoothness: Assumes that the function to be learned 
/ is s.t. x k, y generally implies f(x) « f(y). This 
most basic prior is present in most machine learning, 
but is insufficient to get around the curse of 
dimensionality; see Section 3.2. 

• Multiple explanatory factors: The data generating 
distribution is generated by different underlying 
factors, and for the most part, what one learns about 
one factor generalizes in many configurations of the 
other factors. The objective to recover or at least 
disentangle these underlying factors of variation is 
discussed in Section 3.5. This assumption is behind 
the idea of distributed representations, discussed in 
Section 3.3. 

• A hierarchical organization of explanatory factors: The 
concepts that are useful for describing the world 
around us can be defined in terms of other concepts, 
in a hierarchy, with more abstract concepts higher in 
the hierarchy, defined in terms of less abstract ones. 
This assumption is exploited with deep representa¬ 
tions, elaborated in Section 3.4. 

• Semi-supervised learning: With inputs X and target Y 
to predict, a subset of the factors explaining X's 
distribution explain much of Y, given X. Hence, 
representations that are useful for P(X) tend to be 
useful when learning P(Y \ X), allowing sharing of 
statistical strength between the unsupervised and 
supervised learning tasks, see Section 4. 

• Shared factors across tasks: With many ys of interest or 
many learning tasks in general, tasks (e.g., the 
corresponding P(Y \ X, task)) are explained by 
factors that are shared with other tasks, allowing 
sharing of statistical strengths across tasks, as 


discussed in the previous section (multitask and 
transfer learning, domain adaptation). 

• Manifolds: Probability mass concentrates near re¬ 
gions that have a much smaller dimensionality than 
the original space where the data live. This is 
explicitly exploited in some of the autoencoder 
algorithms and other manifold-inspired algorithms 
described, respectively, in Sections 7.2 and 8. 

• Natural clustering: Different values of categorical 
variables such as object classes are associated with 
separate manifolds. More precisely, the local varia¬ 
tions on the manifold tend to preserve the value of a 
category, and a linear interpolation between exam¬ 
ples of different classes in general involves going 
through a low-density region, i.e., P(X \ Y = i) for 
different i tend to be well separated and not overlap 
much. For example, this is exploited in the manifold 
tangent classifier (MTC) discussed in Section 8.3. 
This hypothesis is consistent with the idea that 
humans have named categories and classes because 
of such statistical structure (discovered by their 
brain and propagated by their culture), and machine 
learning tasks often involve predicting such catego¬ 
rical variables. 

• Temporal and spatial coherence: Consecutive (from a 
sequence) or spatially nearby observations tend to be 
associated with the same value of relevant catego¬ 
rical concepts or result in a small move on the 
surface of the high-density manifold. More gener¬ 
ally, different factors change at different temporal 
and spatial scales, and many categorical concepts of 
interest change slowly. When attempting to capture 
such categorical variables, this prior can be enforced 
by making the associated representations slowly 
changing, i.e., penalizing changes in values over 
time or space. This prior was introduced in [6] and is 
discussed in Section 11.3. 

• Sparsity: For any given observation x, only a small 
fraction of the possible factors are relevant. In terms 
of representation, this could be represented by 
features that are often zero (as initially proposed 
by Olshausen and Field [155]), or by the fact that 
most of the extracted features are insensitive to small 
variations of x. This can be achieved with certain 
forms of priors on latent variables (peaked at 0), or 
by using a nonlinearity whose value is often flat at 0 
(i.e., 0 and with a 0 derivative), or simply by 
penalizing the magnitude of the Jacobian matrix (of 
derivatives) of the function mapping input to 
representation. This is discussed in Sections 6.1.1 
and 7.2. 

• Simplicity of factor dependencies: In good high-level 
representations, the factors are related to each other 
through simple, typically linear dependencies. This 
can be seen in many laws of physics and is assumed 
when plugging a linear predictor on top of a learned 
representation. 

We can view many of the above priors as ways to help the 
learner discover and disentangle some of the underlying (and 
a priori unknown) factors of variation that the data may 
reveal. This idea is pursued further in Sections 3.5 and 11.4. 
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3.2 Smoothness and the Curse of Dimensionality 

For AI tasks such as vision and NLP, it seems hopeless to 
rely only on simple parametric models (such as linear 
models) because they cannot capture enough of the 
complexity of interest unless provided with the appropriate 
feature space. Conversely, machine learning researchers 
have sought flexibility in local 8 nonparametric learners such as 
kernel machines with a fixed generic local-response kernel 
(such as the Gaussian kernel). Unfortunately, as argued at 
length by Bengio and Monperrus [17], Bengio et al. [21], 
Bengio and LeCun [16], Bengio [11], and Bengio et al. [25], 
most of these algorithms only exploit the principle of local 
generalization, i.e., the assumption that the target function (to 
be learned) is smooth enough, so they rely on examples to 
explicitly map out the wrinkles of the target function. General¬ 
ization is mostly achieved by a form of local interpolation 
between neighboring training examples. Although smooth¬ 
ness can be a useful assumption, it is insufficient to deal 
with the curse of dimensionality because the number of such 
wrinkles (ups and downs of the target function) may grow 
exponentially with the number of relevant interacting 
factors when the data are represented in raw input space. 
We advocate learning algorithms that are flexible and 
nonparametric, 6 7 but do not rely exclusively on the smooth¬ 
ness assumption. Instead, we propose to incorporate generic 
priors such as those enumerated above into representation¬ 
learning algorithms. Smoothness-based learners (such as 
kernel machines) and linear models can still be useful on top 
of such learned representations. In fact, the combination of 
learning a representation and kernel machine is equivalent 
to learning the kernel, i.e., the feature space. Kernel machines 
are useful, but they depend on a prior definition of a suitable 
similarity metric or a feature space in which naive similarity 
metrics suffice. We would like to use the data, along with 
very generic priors, to discover those features or, equiva¬ 
lently, a similarity function. 

3.3 Distributed Representations 

Good representations are expressive, meaning that a reason¬ 
ably sized learned representation can capture a huge 
number of possible input configurations. A simple counting 
argument helps us to assess the expressiveness of a model 
producing a representation: How many parameters does it 
require compared to the number of input regions (or 
configurations) it can distinguish? Learners of one-hot 
representations, such as traditional clustering algorithms, 
Gaussian mixtures, nearest-neighbor algorithms, decision 
trees, or Gaussian SVMs, all require O(N) parameters (and/ 
or O(N) examples) to distinguish O(N) input regions. One 
could naively believe that one cannot do better. However, 
restricted Boltzmann machines (RBMs), sparse coding, 
autoencoders, or multilayer neural networks can all 
represent up to 0(2 k ) input regions using only O(N) 
parameters (with k the number of nonzero elements in a 

6. Local in the sense that the value of the learned function at x depends 
mostly on training examples x^s close to x. 

7. We understand nonparametric as including all learning algorithms 
whose capacity can be increased appropriately as the amount of data 
and its complexity demands it, for example, including mixture models 
and neural networks where the number of parameters is a data- 
selected hyperparameter. 


sparse representation, and k = N in nonsparse RBMs and 
other dense representations). These are all distributed 8 or 
sparse 9 representations. The generalization of clustering to 
distributed representations is multiclustering, where either 
several clusterings take place in parallel or the same 
clustering is applied on different parts of the input, such 
as in the very popular hierarchical feature extraction for 
object recognition based on a histogram of cluster categories 
detected in different patches of an image [120], [48]. The 
exponential gain from distributed or sparse representations 
is discussed further in [11, Section 3.2, Fig. 3.2]. It comes 
about because each parameter (e.g., the parameters of one of 
the units in a sparse code or one of the units in an RBM) can 
be reused in many examples that are not simply near 
neighbors of each other, whereas with local generalization, 
different regions in input space are basically associated with 
their own private set of parameters, for example, as in 
decision trees, nearest neighbors, Gaussian SVMs, and so 
on. In a distributed representation, an exponentially large 
number of possible subsets of features or hidden units can be 
activated in response to a given input. In a single-layer 
model, each feature is typically associated with a preferred 
input direction, corresponding to a hyperplane in input 
space, and the code or representation associated with that 
input is precisely the pattern of activation (which features 
respond to the input, and how much). This is in contrast 
with a nondistributed representation such as the one 
learned by most clustering algorithms, for example, 
k-means, in which the representation of a given input 
vector is a one-hot code identifying which one of a small 
number of cluster centroids best represents the input. 10 

3.4 Depth and Abstraction 

Depth is a key aspect to the representation learning strategies 
we consider in this paper. As we will discuss, deep 
architectures are often challenging to train effectively, and 
this has been the subject of much recent research and 
progress. However, despite these challenges, they carry two 
significant advantages that motivate our long-term interest in 
discovering successful training strategies for deep architec¬ 
tures. These advantages are: 1) deep architectures promote 
the reuse of features, and 2) deep architectures can potentially 
lead to progressively more abstract features at higher layers of 
representations (more removed from the data). 

Feature reuse. The notion of reuse, which explains the 
power of distributed representations, is also at the heart of 
the theoretical advantages behind deep learning, i.e., 

8. Distributed representations, where k out of N representation 
elements or feature values can be independently varied; for example, 
they are not mutually exclusive. Each concept is represented by having k 
features being turned on or active, while each feature is involved in 
representing many concepts. 

9. Sparse representations: distributed representations where only a few 
of the elements can be varied at a time, i.e., k < N. 

10. As discussed in [11], things are only slightly better when allowing 
continuous-valued membership values, for example, in ordinary mixture 
models (with separate parameters for each mixture component), but the 
difference in representational power is still exponential [149]. The situation 
may also seem better with a decision tree, where each given input is 
associated with a one-hot code over the tree leaves, which deterministically 
selects associated ancestors (the path from root to node). Unfortunately, the 
number of different regions represented (equal to the number of leaves of 
the tree) still only grows linearly with the number of parameters used to 
specify it [15]. 
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constructing multiple levels of representation or learning a 
hierarchy of features. The depth of a circuit is the length of 
the longest path from an input node of the circuit to an 
output node of the circuit. The crucial property of a deep 
circuit is that its number of paths, i.e., ways to reuse different 
parts, can grow exponentially with its depth. Formally, one 
can change the depth of a given circuit by changing the 
definition of what each node can compute, but only by a 
constant factor. The typical computations we allow in each 
node include: weighted sum, product, artificial neuron 
model (such as a monotone nonlinearity on top of an 
affine transformation), computation of a kernel, or logic 
gates. Theoretical results clearly show families of functions 
where a deep representation can be exponentially more 
efficient than one that is insufficiently deep [82], [83], [21], 
[16], [15]. If the same family of functions can be 
represented with fewer parameters (or more precisely 
with a smaller VC dimension), learning theory would 
suggest that it can be learned with fewer examples, 
yielding improvements in both computational efficiency 
(less nodes to visit) and statistical efficiency (fewer 
parameters to learn, and reuse of these parameters over 
many different kinds of inputs). 

Abstraction and invariance. Deep architectures can lead to 
abstract representations because more abstract concepts can 
often be constructed in terms of less abstract ones. In some 
cases, such as in the convolutional neural network [133], we 
explicitly build this abstraction in via a pooling mechanism 
(see Section 11.2). More abstract concepts are generally 
invariant to most local changes of the input. That makes the 
representations that capture these concepts generally highly 
nonlinear functions of the raw input. This is obviously true 
of categorical concepts, where more abstract representations 
detect categories that cover more varied phenomena (e.g., 
larger manifolds with more wrinkles), and thus they 
potentially have greater predictive power. Abstraction can 
also appear in high-level continuous-valued attributes that 
are only sensitive to some very specific types of changes in 
the input. Learning these sorts of invariant features has 
been a long-standing goal in pattern recognition. 

3.5 Disentangling Factors of Variation 

Beyond being distributed and invariant, we would like our 
representations to disentangle the factors of variation. Different 
explanatory factors of the data tend to change indepen¬ 
dently of each other in the input distribution, and only a 
few at a time tend to change when one considers a sequence 
of consecutive real-world inputs. 

Complex data arise from the rich interaction of many 
sources. These factors interact in a complex web that can 
complicate Al-related tasks such as object classification. For 
example, an image is composed of the interaction between 
one or more light sources, the object shapes, and the 
material properties of the various surfaces present in the 
image. Shadows from objects in the scene can fall on each 
other in complex patterns, creating the illusion of object 
boundaries where there are none, and can dramatically 
effect the perceived object shape. Flow can we cope with 
these complex interactions? Flow can we disentangle the 
objects and their shadows? Ultimately, we believe the 
approach we adopt for overcoming these challenges must 
leverage the data itself, using vast quantities of unlabeled 


examples, to learn representations that separate the various 
explanatory sources. Doing so should give rise to a 
representation significantly more robust to the complex 
and richly structured variations extant in natural data 
sources for Al-related tasks. 

It is important to distinguish between the related but 
distinct goals of learning invariant features and learning to 
disentangle explanatory factors. The central difference is 
the preservation of information. Invariant features, by 
definition, have reduced sensitivity in the direction of 
invariance. This is the goal of building features that are 
insensitive to variation in the data that are uninformative 
to the task at hand. Unfortunately, it is often difficult to 
determine a priori which set of features and variations will 
ultimately be relevant to the task at hand. Further, as is 
often the case in the context of deep learning methods, the 
feature set being trained may be destined to be used in 
multiple tasks that may have distinct subsets of relevant 
features. Considerations such as these lead us to the 
conclusion that the most robust approach to feature 
learning is to disentangle as many factors as possible, discarding 
as little information about the data as is practical. If some form 
of dimensionality reduction is desirable, then we hypothe¬ 
size that the local directions of variation least represented 
in the training data should be first to be pruned out (as in 
principal components analysis (PCA), for example, which 
does it globally instead of around each example). 

3.6 Good Criteria for Learning Representations? 

One of the challenges of representation learning that 
distinguishes it from other machine learning tasks such 
as classification is the difficulty in establishing a clear 
objective or target for training. In the case of classification, 
the objective is (at least conceptually) obvious; we want to 
minimize the number of misclassifications on the training 
dataset. In the case of representation learning, our objective 
is far removed from the ultimate objective, which typically 
is learning a classifier or some other predictor. Our 
problem is reminiscent of the credit assignment problem 
encountered in reinforcement learning. We have proposed 
that a good representation is one that disentangles the 
underlying factors of variation, but how do we translate 
that into appropriate training criteria? Is it even necessary 
to do anything but maximize likelihood under a good 
model or can we introduce priors such as those enumer¬ 
ated above (possibly data-dependent ones) that help the 
representation better do this disentangling? This question 
remains clearly open but is discussed in more detail in 
Sections 3.5 and 11.4. 

4 Building Deep Representations 

In 2006, a breakthrough in feature learning and deep 
learning was initiated by Geoff Flinton and quickly 
followed up in the same year [94], [23], [161], and soon 
after by Lee et al. [134] and many more later. It has been 
extensively reviewed and discussed in [11]. A central idea, 
referred to as greedy layerwise unsupervised pretraining, was to 
learn a hierarchy of features one level at a time, using 
unsupervised feature learning to learn a new transforma¬ 
tion at each level to be composed with the previously 
learned transformations; essentially, each iteration of 
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unsupervised feature learning adds one layer of weights to 
a deep neural network. Finally, the set of layers could be 
combined to initialize a deep supervised predictor, such as 
a neural network classifier, or a deep generative model, 
such as a deep Boltzmann machine (DBM) [176]. 

This paper is mostly about feature learning algorithms 
that can be used to form deep architectures. In particular, it 
was empirically observed that layerwise stacking of feature 
extraction often yielded better representations, for example, 
in terms of classification error [119], [65], quality of the 
samples generated by a probabilistic model [176], or in 
terms of the invariance properties of the learned features 
[71]. Whereas this section focuses on the idea of stacking 
single-layer models, Section 10 follows up with a discussion 
on joint training of all the layers. 

After greedy layerwise unsupervised pretraining, the 
resulting deep features can be used either as input to a 
standard supervised machine learning predictor (such as an 
SVM) or as initialization for a deep supervised neural 
network (e.g., by appending a logistic regression layer or 
purely supervised layers of a multilayer neural network). 
The layerwise procedure can also be applied in a purely 
supervised setting, called the greedy layerwise supervised 
pretraining [23]. For example, after the first one-hidden- 
layer multilayer perceptrons (MLP) is trained, its output 
layer is discarded and another one-hidden-layer MLP can 
be stacked on top of it, and so on. Although results reported 
in [23] were not as good as for unsupervised pretraining, 
they were nonetheless better than without pretraining at all. 
Alternatively, the outputs of the previous layer can be fed as 
extra inputs for the next layer (in addition to the raw input), 
as successfully done in [221]. Another variant [184] 
pretrains in a supervised way all the previously added 
layers at each step of the iteration, and in their experiments, 
this discriminant variant yielded better results than un¬ 
supervised pretraining. 

Whereas combining single layers into a supervised 
model is straightforward, it is less clear how layers 
pretrained by unsupervised learning should be combined 
to form a better unsupervised model. We cover here some of 
the approaches to do so, but no clear winner emerges, and 
much work has to be done to validate existing proposals or 
improve them. 

The first proposal was to stack pretrained RBMs into a 
deep belief network [94] or DBN, where the top layer is 
interpreted as an RBM and the lower layers as a directed 
sigmoid belief network. Flowever, it is not clear how to 
approximate maximum likelihood training to further 
optimize this generative model. One option is the wake- 
sleep algorithm [94], but more work should be done to 
assess the efficiency of this procedure in terms of improving 
the generative model. 

The second approach that has been put forward is to 
combine the RBM parameters into a DBM by basically 
halving the RBM weights to obtain the DBM weights [176]. 
The DBM can then be trained by approximate maximum 
likelihood as discussed in more detail later (Section 10.2). 
This joint training has brought substantial improvements, 
both in terms of likelihood and in terms of classification 
performance of the resulting deep feature learner [176]. 


Another early approach was to stack RBMs or autoenco¬ 
ders into a deep autoencoder [92], If we have a series of encoder- 
decoder pairs (/W(-), f/W]-)), then the overall encoder is the 
composition of the encoders, f ^ N \... /( 2 ) (/l 1 ) (•))), and the 
overall decoder is its "transpose" (often with transposed 
weight matrices as well), •• (•))). The deep 

autoencoder (or its regularized version, as discussed in 
Section 7.2) can then be jointly trained, with all the 
parameters optimized with respect to a global reconstruction 
error criterion. More work on this avenue clearly needs to be 
done, and it was probably avoided by fear of the challenges in 
training deep feedforward networks, discussed in Section 10 
along with very encouraging recent results. 

Yet another recently proposed approach to training deep 
architectures [154] is to consider the iterative construction of 
a free energy function (i.e., with no explicit latent variables, 
except possibly for a top-level layer of hidden units) for a 
deep architecture as the composition of transformations 
associated with lower layers, followed by top-level hidden 
units. The question is then how to train a model defined by 
an arbitrary parameterized (free) energy function. Ngiam 
et al. [154] have used hybrid Monte Carlo [153], but other 
options include contrastive divergence (CD) [88], [94], score 
matching [97], [99], denoising score matching [112], [210], 
ratio matching [98], and noise-contrastive estimation [80]. 

5 Single-Layer Learning Modules 

Within the community of researchers interested in repre¬ 
sentation learning, there has developed two broad parallel 
lines of inquiry: one rooted in probabilistic graphical 
models and one rooted in neural networks. Fundamentally, 
the difference between these two paradigms is whether the 
layered architecture of a deep learning model is to be 
interpreted as describing a probabilistic graphical model or 
as describing a computation graph. In short, are hidden 
units considered latent random variables or as computa¬ 
tional nodes? 

To date, the dichotomy between these two paradigms 
has remained in the background, perhaps because they 
appear to have more characteristics in common than 
separating them. We suggest that this is likely a function 
of the fact that much recent progress in both of these areas 
has focused on single-layer greedy learning modules and the 
similarities between the types of single-layer models that 
have been explored: mainly, the RBM on the probabilistic 
side and the autoencoder variants on the neural network 
side. Indeed, as shown by one of us [210] and others [200], 
in the case of the RBM, training the model via an inductive 
principle known as score matching [97] (to be discussed in 
Section 6.4.3) is essentially identical to applying a regular¬ 
ized reconstruction objective to an autoencoder. Another 
strong link between pairs of models on both sides of this 
divide is when the computational graph for computing 
representation in the neural network model corresponds 
exactly to the computational graph that corresponds to 
inference in the probabilistic model, and this happens to 
also correspond to the structure of graphical model itself 
(e.g., as in the RBM). 

The connection between these two paradigms becomes 
more tenuous when we consider deeper models, where in 
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the case of a probabilistic model, exact inference typically 
becomes intractable. In the case of deep models, the 
computational graph diverges from the structure of the 
model. For example, in the case of a DBM, unrolling 
variational (approximate) inference into a computational 
graph results in a recurrent graph structure. We have 
performed preliminary exploration [179] of deterministic 
variants of deep autoencoders whose computational graph 
is similar to that of a DBM (in fact very close to the mean- 
field variational approximations associated with the 
Boltzmann machine), and that is one interesting inter¬ 
mediate point to explore (between the deterministic 
approaches and the graphical model approaches). 

In the next few sections, we will review the major 
developments in single-layer training modules used to 
support feature learning and particularly deep learning. We 
divide these sections between (Section 6) the probabilistic 
models, with inference and training schemes that directly 
parameterize the generative—or decoding —pathway and 
(Section 7) the typically neural network-based models that 
directly parametrize the encoding pathway. Interestingly, 
some models like predictive sparse decomposition (PSD) 
[109] inherit both properties and will also be discussed 
(Section 7.2.4). We then present a different view of 
representation learning, based on the associated geometry 
and the manifold assumption, in Section 8. 

First, let us consider an unsupervised single-layer 
representation learning algorithm spanning all three views: 
probabilistic, autoencoder, and manifold learning. 

5.1 PCA 

We will use probably the oldest feature extraction algo¬ 
rithm, PCA, to illustrate the probabilistic, autoencoder, and 
manifold views of representation learning. PCA learns a 
linear transformation h = fix) = W T x + b of input x £ ~K\r x , 
where the columns of d x x d/, matrix W form an orthogonal 
basis for the dh orthogonal directions of greatest variance in 
the training data. The result is dh features (the components 
of representation h) that are decorrelated. The three 
interpretations of PCA are the following: 1) It is related to 
probabilistic models (Section 6) such as probabilistic PCA, 
factor analysis, and the traditional multivariate Gaussian 
distribution (the leading eigenvectors of the covariance 
matrix are the principal components); 2) the representation 
it learns is essentially the same as that learned by a basic 
linear autoencoder (Section 7.2); and 3) it can be viewed as a 
simple linear form of linear manifold learning (Section 8), i.e., 
characterizing a lower dimensional region in input space 
near which the data density is peaked. Thus, PCA may be in 
the back of the reader's mind as a common thread relating 
these various viewpoints. Unfortunately, the expressive 
power of linear features is very limited: They cannot be 
stacked to form deeper, more abstract representations since 
the composition of linear operations yields another linear 
operation. Here, we focus on recent algorithms that have 
been developed to extract nonlinear features, which can be 
stacked in the construction of deep networks, although 
some authors simply insert a nonlinearity between learned 
single-layer linear projections [125], [43]. 

Another rich family of feature extraction techniques that 
this review does not cover in any detail due to space 


constraints is independent component analysis or ICA 
[108], [8]. Instead, we refer the reader to [101], [103]. Note 
that while in the simplest case (complete, noise free) ICA 
yields linear features, in the more general case it can be 
equated with a linear generative model with non-Gaussian 
independent latent variables, similar to sparse coding 
(Section 6.1.1), which result in nonlinear features. Therefore, 
ICA and its variants like independent and topographic ICA 
[102] can and have been used to build deep networks [122], 
[125] (see Section 11.2). The notion of obtaining indepen¬ 
dent components also appears similar to our stated goal of 
disentangling underlying explanatory factors through deep 
networks. However, for complex real-world distributions, it 
is doubtful that the relationship between truly independent 
underlying factors and the observed high-dimensional data 
can be adequately characterized by a linear transformation. 

6 Probabilistic Models 

From the probabilistic modeling perspective, the question 
of feature learning can be interpreted as an attempt to 
recover a parsimonious set of latent random variables that 
describe a distribution over the observed data. We can 
express as p(x, h) a probabilistic model over the joint space 
of the latent variables, h, and observed data or visible 
variables x. Feature values are conceived as the result of an 
inference process to determine the probability distribution 
of the latent variables given the data, i.e., p(h \ x), often 
referred to as the posterior probability. Learning is conceived 
in terms of estimating a set of model parameters that 
(locally) maximize the regularized likelihood of the training 
data. The probabilistic graphical model formalism gives us 
two possible modeling paradigms in which we can consider 
the question of inferring latent variables, directed and 
undirected graphical models, which differ in their para¬ 
meterization of the joint distribution p(x, h), yielding major 
impact on the nature and computational costs of both 
inference and learning. 

6.1 Directed Graphical Models 

Directed latent factor models separately parameterize the 
conditional likelihood p(x \ h) and the prior p(h) to construct 
the joint distribution, p(x, h) = p(x \ h)p(h). Examples of 
this decomposition include: PCA [171], [206], sparse coding 
[155], sigmoid belief networks [152], and the newly 
introduced spike-and-slab sparse coding (S3C) model [72]. 

6.1.1 Explaining Away 

Directed models often lead to one important property: 
explaining away , i.e., a priori independent causes of an event 
can become nonindependent given the observation of the 
event. Latent factor models can generally be interpreted as 
latent cause models, where the h activations cause the 
observed x. This renders the a priori independent h to be 
nonindependent. As a consequence, recovering the posterior 
distribution of h,p(h \ x) (which we use as a basis for feature 
representation), is often computationally challenging and 
can be entirely intractable, especially when h is discrete. 

A classic example that illustrates the phenomenon is to 
imagine you are on vacation away from home and you 
receive a phone call from the security system company 
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telling you that the alarm has been activated. You begin 
worrying your home has been burglarized, but then you 
hear on the radio that a minor earthquake has been reported 
in the area of your home. If you happen to know from prior 
experience that earthquakes sometimes cause your home 
alarm system to activate, then suddenly you relax, con¬ 
fident that your home has very likely not been burglarized. 

The example illustrates how the alarm activation rendered 
two otherwise entirely independent causes, burglarized and 
earthquake, to become dependent—in this case, the depen¬ 
dency is one of mutual exclusivity. Since both burglarized 
and earthquake are very rare events and both can cause alarm 
activation, the observation of one explains away the other. 
Despite the computational obstacles we face when attempt¬ 
ing to recover the posterior over h, explaining away 
promises to provide a parsimonious p(h \ x), which can be 
an extremely useful characteristic of a feature encoding 
scheme. If one thinks of a representation as being composed 
of various feature detectors and estimated attributes of the 
observed input, it is useful to allow the different features to 
compete and collaborate with each other to explain the 
input. This is naturally achieved with directed graphical 
models, but can also be achieved with undirected models 
(see Section 6.2) such as Boltzmann machines if there are 
lateral connections between the corresponding units or 
corresponding interaction terms in the energy function that 
defines the probability model. 

Probabilistic interpretation of PC A. PC A can be given a 
natural probabilistic interpretation [171], [206] as factor 
analysis: 

p(h)=Af(h; 0,a£l), 
p(x | h) = Af(x\ Wh + fji x , all), 

where x E M'S h E lR d '‘, Af(v, p, E) is the multivariate 
normal density of v with mean p and covariance E, and 
columns of W span the same space as leading <4 principal 
components, but are not constrained to be orthonormal. 

Sparse coding. Like PCA, sparse coding has both a 
probabilistic and nonprobabilistic interpretation. Sparse 
coding also relates a latent representation h (either a vector 
of random variables or a feature vector, depending on the 
interpretation) to the data x through a linear mapping W, 
which we refer to as the dictionary. The difference between 
sparse coding and PCA is that sparse coding includes a 
penalty to ensure a sparse activation of h is used to encode 
each input x. From a nonprobabilistic perspective, sparse 
coding can be seen as recovering the code or feature vector 
associated with a new input x via 

h* = f(x) = argmin \\x - Wh\\\ + A \\h\h- (2) 

h 

Learning the dictionary W can be accomplished by optimiz¬ 
ing the following training criterion with respect to W: 

J S c = J2\\xU-Wh*W\\l (3) 

t 

where x® is the fth example and h is the corresponding 
sparse code determined by (2). W is usually constrained to 
have unit-norm columns (because one can arbitrarily 


exchange scaling of column i with scaling of hf\ such a 
constraint is necessary for the LI penalty to have any effect). 

The probabilistic interpretation of sparse coding differs 
from that of PCA in that instead of a Gaussian prior on the 
latent random variable h, we use a sparsity inducing 
Laplace prior (corresponding to an LI penalty): 

P{h) = n^ exp (-A|/i,|), 

i \ / 

p(x | h) = Af(x\ Wh + p x , off). 

In the case of sparse coding, because we will seek a sparse 
representation (i.e., one with many features set to exactly 
zero), we will be interested in recovering the maximum 
a posteriori value (MAP value of h, i.e., h* = argmax ft p(/i | 
x) rather than its expected value IE[/i | x]). Under this 
interpretation, dictionary learning proceeds as maximizing 
the likelihood of the data given these MAP values of h*: 
argmax[,(, ]j| ( p(x <L ' > \ h *W) subject to the norm constraint on 
W. Note that this parameter learning scheme, subject to the 
MAP values of the latent h, is not standard practice in 
the probabilistic graphical model literature. Typically, the 
likelihood of the data p(x) = Jf h p(x \ h)p(h) is maximized 
directly. In the presence of latent variables, expectation 
maximization is employed, where the parameters are 
optimized with respect to the marginal likelihood, i.e., 
summing or integrating the joint log likelihood over all 
values of the latent variables under their posterior P(h \ x), 
rather than considering only the single MAP value of h. 
The theoretical properties of this form of parameter 
learning are not yet well understood but seem to work 
well in practice (e.g., k-means versus Gaussian mixture 
models and Viterbi training for HMMs). Note also that the 
interpretation of sparse coding as a MAP estimation can 
be questioned [77] because, even though the interpretation 
of the LI penalty as a log prior is a possible interpretation, 
there can be other Bayesian interpretations compatible with 
the training criterion. 

Sparse coding is an excellent example of the power of 
explaining away. Even with a very overcomplete diction¬ 
ary, 11 the MAP inference process used in sparse coding to 
find h* can pick out the most appropriate bases and zero 
the others, despite them having a high degree of correla¬ 
tion with the input. This property arises naturally in 
directed graphical models such as sparse coding and is 
entirely due to the explaining away effect. It is not seen in 
commonly used undirected probabilistic models such as 
the RBM, nor is it seen in parametric feature encoding 
methods such as autoencoders. The tradeoff is that 
compared to methods such as RBMs and autoencoders, 
inference in sparse coding involves an extra inner loop of 
optimization to find h* with a corresponding increase in 
the computational cost of feature extraction. Compared to 
autoencoders and RBMs, the code in sparse coding is a free 
variable for each example, and in that sense, the implicit 
encoder is nonparametric. 

One might expect that the parsimony of the sparse 
coding representation and its explaining away effect would 
be advantageous and indeed it seems to be the case. Coates 

11. Overcomplete: with more dimensions of h than dimensions of x. 
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and Ng [48] demonstrated on the CIFAR-10 object 
classification task [116] with a patch-base feature extraction 
pipeline that in the regime with few (<1,000) labeled 
training examples per class, the sparse coding representa¬ 
tion significantly outperformed other highly competitive 
encoding schemes. Possibly because of these properties and 
because of the very computationally efficient algorithms 
that have been proposed for it (in comparison with the 
general case of inference in the presence of explaining 
away), sparse coding enjoys considerable popularity as a 
feature learning and encoding paradigm. There are numer¬ 
ous examples of its successful application as a feature 
representation scheme, including natural image modeling 
[159], [109], [48], [224], audio classification [78], NLP [4], as 
well as being a very successful model of the early visual 
cortex [155]. Sparsity criteria can also be generalized 
successfully to yield groups of features that prefer to all 
be zero, but if one or a few of them are active, then the 
penalty for activating others in the group is small. Different 
groiLp sparsity patterns can incorporate different forms of 
prior knowledge [110], [107], [3], [76]. 

S3C. S3C is one example of a promising variation on 
sparse coding for feature learning [73]. The S3C model 
possesses a set of latent binary spike variables together with 
a set of latent real-valued slab variables. The activation of 
the spike variables dictates the sparsity pattern. S3C has 
been applied to the CIFAR-10 and CIFAR-100 object 
classification tasks [116], and shows the same pattern as 
sparse coding of superior performance in the regime of 
relatively few (< 1,000) labeled examples per class [73]. In 
fact, in both the CIFAR-100 dataset (with 500 examples per 
class) and the CIFAR-10 dataset (when the number of 
examples is reduced to a similar range), the S3C representa¬ 
tion actually outperforms sparse coding representations. 
This advantage was revealed clearly with S3C winning the 
NIPS '11 Transfer Learning Challenge [72], 

6.2 Undirected Graphical Models 

Undirected graphical models, also called Markov random 
fields (MRFs), parameterize the joint p(x, h) through a 
product of unnormalized nonnegative clique potentials: 

p(x,h) =^rY[ipi(x)Y[T) j (h)Y[u k (x,h), (5) 

Zj0 i j k 

where tpi(x), r]j(h), and Vf.(x,h) are the clique potentials 
describing the interactions between the visible elements, 
between the hidden variables, and those interaction 
between the visible and hidden variables, respectively. 
The partition function Zg ensures that the distribution is 
normalized. Within the context of unsupervised feature 
learning, we generally see a particular form of MRF called a 
Boltzmann distribution with clique potentials constrained 
to be positive: 

p{x, h) = —exp{—£${x, h)), (6) 

Zg 

where £g(x,h) is the energy function and contains the 
interactions described by the MRF clique potentials and 0 are 
the model parameters that characterize these interactions. 

The Boltzmann machine was originally defined as a 
network of symmetrically coupled binary random variables 


or units. These stochastic units can be divided into two 
groups: 1) the visible units x e {0, l} d * that represent the 
data, and 2) the hidden or latent units h 6 {0, l} dk that 
mediate dependencies between the visible units through 
their mutual interactions. The pattern of interaction is 
specified through the energy function: 

£® M (x,h) = —-x T Ux — -h T Vh. — x T Wh — b T x — d T h, (7) 

where 9 = {U, V, W, b, d} are the model parameters that, 
respectively, encode the visible-to-visible interactions, the 
hidden-to-hidden interactions, the visible-to-hidden inter¬ 
actions, the visible self-connections, and the hidden self¬ 
connections (called biases). To avoid overparameterization, 
the diagonals of U and V are set to zero. 

The Boltzmann machine energy function specifies the 
probability distribution over [x, h], via the Boltzmann 
distribution, (6), with the partition function Zg given by 

Xl=l h d,i ~ I 

exp ( — £^ m (x, h\9)'). (8) 

£1=0 0 h\ —0 hd h = 0 

This joint probability distribution gives rise to the set of 
conditional distributions of the form: 

P{hi | x, h\i ) = sigmoid ( Y WjiXj + Y 14 'he + d t j , (9) 

' 3 i'^i ' 

P(xj | h, x\j) = sigmoid ( Y WjiXj + Y UjfXj, + bj ]. (10) 

In general, inference in the Boltzmann machine is intract¬ 
able. For example, computing the conditional probability of 
hi given the visibles, P(hi \ x), requires marginalizing over 
the rest of the hiddens, which implies evaluating a sum 
with 2 dh ~ 1 terms: 

fti=l ft i-i= l hi+ i=l h d h = 1 

p (hi \ X ) = Y"- ■" p ( h i *)■ ( n ) 

hi—0 hi-i—Ohi+i—0 hd h = 0 

However, with some judicious choices in the pattern of 
interactions between the visible and hidden units, more 
tractable subsets of the model family are possible, as we 
discuss next. 

RBMs. The RBM is likely the most popular subclass of 
Boltzmann machine [190]. It is defined by restricting the 
interactions in the Boltzmann energy function, in (7), to only 
those between h and x, i.e., £^ BM is £f M with U = 0 and 
V = 0. As such, the RBM can be said to form a bipartite 
graph with the visibles and the hiddens forming two layers 
of vertices in the graph (and no connection between units of 
the same layer). With this restriction, the RBM possesses the 
useful property that the conditional distribution over the 
hidden units factorizes given the visibles: 

P(h | x) = Y[P(h l | x), 

\ ( 12 ) 

P(hi = 1 | x) = sigmoid! \ 1 VjiXj + dq I. 
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Likewise, the conditional distribution over the visible units 
given the hiddens also factorizes: 


P{x | h)=Y[P(xj | h), 


P{Xj = 1 | h) = sigmoid ^ ^ Wjihi + b^j . 


(13) 


This makes inferences readily tractable in RBMs. For 
example, the RBM feature representation is taken to be 
the set of posterior marginals P{hi \ x) which, given the 
conditional independence described in (12), are immedi¬ 
ately available. Note that this is in stark contrast to the 
situation with popular directed graphical models for 
unsupervised feature extraction, where computing the 
posterior probability is intractable. 

Importantly, the tractability of the RBM does not extend 
to its partition function, which still involves summing an 
exponential number of terms. It does imply, however, that 
we can limit the number of terms to min{2 <ia: , 2 dh }. Usually, 
this is still an unmanageable number of terms, and therefore 
we must resort to approximate methods to deal with its 
estimation. 

It is difficult to overstate the impact the RBM has had to 
the fields of unsupervised feature learning and deep 
learning. It has been used in a truly impressive variety of 
applications, including fMRI image classification [180], 
motion and spatial transformations [201], [144], collabora¬ 
tive filtering [178], and natural image modeling [160], [53]. 

6.3 Generalizations of the RBM to Real-Valued Data 

Important progress has been made in the last few years in 
defining generalizations of the RBM that better capture real¬ 
valued data, in particular real-valued image data, by better 
modeling the conditional covariance of the input pixels. The 
standard RBM, as discussed above, is defined with both 
binary visible variables v G {0,1} and binary latent vari¬ 
ables h € {0,1}. The tractability of inference and learning in 
the RBM has inspired many authors to extend it, via 
modifications of its energy function, to model other kinds of 
data distributions. In particular, there have been multiple 
attempts to develop RBM-type models of real-valued data, 
where x € lR dl . The most straightforward approach to 
modeling real-valued observations within the RBM frame¬ 
work is the so-called Gaussian RBM (GRBM), where the only 
change in the RBM energy function is to the visible units 
biases, by adding a bias term that is quadratic in the visible 
units x. While it probably remains the most popular way to 
model real-valued data within the RBM framework, 
Ranzato and Hinton [160] suggest that the GRBM has 
proven to be a somewhat unsatisfactory model of natural 
images. The trained features typically do not represent 
sharp edges that occur at object boundaries and lead to 
latent representations that are not particularly useful 
features for classification tasks. Ranzato and Hinton [160] 
argue that the failure of the GRBM to adequately capture 
the statistical structure of natural images stems from the 
exclusive use of the model capacity to capture the 
conditional mean at the expense of the conditional 
covariance. Natural images, they argue, are chiefly char¬ 
acterized by the covariance of the pixel values, not by their 


absolute values. This point is supported by the common use 
of preprocessing methods that standardize the global 
scaling of the pixel values across images in a dataset or 
across the pixel values within each image. 

These kinds of concerns about the ability of the GRBM 
to model natural image data has led to the development of 
alternative RBM-based models that each attempt to take on 
this objective of better modeling nondiagonal conditional 
covariances. Ranzato and Hinton [160] introduced the mean 
and covariance RBM (mcRBM). Like the GRBM, the mcRBM 
is a two-layer Boltzmann machine that explicitly models 
the visible units as Gaussian distributed quantities. How¬ 
ever, unlike the GRBM, the mcRBM uses its hidden layer to 
independently parametrize both the mean and covariance 
of the data through two sets of hidden units. The mcRBM 
is a combination of the covariance RBM (cRBM) [163], 
which models the conditional covariance, and the GRBM, 
which captures the conditional mean. While the GRBM has 
shown considerable potential as the basis of a highly 
successful phoneme recognition system [54], it seems that 
due to difficulties in training the mcRBM, the model has 
been largely superseded by the mPoT model. The mPoT 
model ( mean-product of Student's T-distributions model) [164] 
is a combination of the GRBM and the product of Student's 
T-distributions model [216]. It is an energy-based model 
where the conditional distribution over the visible units 
conditioned on the hidden variables is a multivariate 
Gaussian (nondiagonal covariance) and the complementary 
conditional distribution over the hidden variables given the 
visibles are a set of independent Gamma distributions. 

The PoT model has recently been generalized to the 
mPoT model [164] to include nonzero Gaussian means by 
the addition of GRBM-like hidden units, similarly to how 
the mcRBM generalizes the cRBM. The mPoT model has 
been used to synthesize large-scale natural images [164] that 
show large-scale features and shadowing structure. It has 
been used to model natural textures [113] in a tiled- 
convolution configuration (see Section 11.2). 

Another recently introduced RBM-based model with the 
objective of having the hidden units encode both the mean 
and covariance information is the spike-and-slab restricted 
Boltzmann machine (ssRBM) [52], [53]. The ssRBM is 
defined as having both a real-valued "slab" variable and a 
binary "spike" variable associated with each unit in the 
hidden layer. The ssRBM has been demonstrated as a 
feature learning and extraction scheme in the context of 
CIFAR-10 object classification [116] from natural images 
and has performed well in the role [52], [53]. When trained 
convolutionally (see Section 11.2) on full CIFAR-10 natural 
images, the model demonstrated the ability to generate 
natural image samples that seem to capture the broad 
statistical structure of natural images better than previous 
parametric generative models, as illustrated with the 
samples of Fig. 2. 

The mcRBM, mPoT, and ssRBM each set out to model 
real-valued data such that the hidden units encode not only 
the conditional mean of the data but also its conditional 
covariance. Other than differences in the training schemes, 
the most significant difference between these models is how 
they encode their conditional covariance. While the mcRBM 
and the mPoT use the activation of the hidden units to 
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Fig. 2. (Top) Samples from convolutionally trained /i-ssRBM from 
[53]. (Bottom) Images in the CIFAR-10 training set closest 
(L2 distance with contrast normalized training images) to correspond¬ 
ing model samples on top. The model does not appear to be 
overfitting particular training examples. 

enforce constraints on the covariance of x, the ssRBM uses 
the hidden unit to pinch the precision matrix along the 
direction specified by the corresponding weight vector. 
These two ways of modeling conditional covariance 
diverge when the dimensionality of the hidden layer is 
significantly different from that of the input. In the 
overcomplete setting, sparse activation with the ssRBM 
parametrization permits variance only in the select direc¬ 
tions of the sparsely activated hidden units. This is a 
property the ssRBM shares with sparse coding models 
[155], [78]. On the other hand, in the case of the mPoT or 
mcRBM, an overcomplete set of constraints on the covar¬ 
iance implies that capturing arbitrary covariance along a 
particular direction of the input requires decreasing 
potentially all constraints with positive projection in that 
direction. This perspective would suggest that the mPoT 
and mcRBM do not appear to be well suited to provide a 
sparse representation in the overcomplete setting. 

6.4 RBM Parameter Estimation 

Many of the RBM training methods we discuss here are 
applicable to more general undirected graphical models, 
but are particularly practical in the RBM setting. Freund 
and Haussler [66] proposed a learning algorithm for 
harmoniums (RBMs) based on projection pursuit. CD [88], 
[94] has been used most often to train RBMs, and many 
recent papers use stochastic maximum likelihood (SML) 
[220], [204], 

As discussed in Section 6.1, in training probabilistic 
models, parameters are typically adapted to maximize the 
likelihood of the training data (or equivalently the log 
likelihood, or its penalized version, which adds a regular¬ 
ization term). With T training examples, the log likelihood 
is given by 


T T _ 

5>g P{x {t) -,e) = J2 log E p{ x ^\h-e). (14) 

t=1 i=1 Aefo.l}'''- 

Gradient-based optimization requires its gradient, which 
for Boltzmann machines is given by 
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where we have the expectations with respect to p(h^ | x®) 
under the "clamped" condition (also called the positive 
phase) and over the full joint p(x, h) under the "unclamped" 
condition (also called the negative phase). Intuitively, the 
gradient acts to locally move the model distribution (the 
negative phase distribution) toward the data distribution 
(positive phase distribution) by pushing down the energy of 
(h,x^l) pairs (for h ~ P(h \ x®)) while pushing up the 
energy of (h, x) pairs (for (h, x) ~ P(h, x)) until the two 
forces are in equilibrium, at which point the sufficient 
statistics (gradient of the energy function) have equal 
expectations with x sampled from the training distribution 
or with x sampled from the model. 

The RBM conditional independence properties imply 
that the expectation in the positive phase of (15) is 
tractable. The negative phase term—arising from the 
partition function's contribution to the log-likelihood 
gradient—is more problematic because the computation 
of the expectation over the joint is not tractable. The 
various ways of dealing with the partition function's 
contribution to the gradient have brought about a number 
of different training algorithms, many trying to approx¬ 
imate the log-likelihood gradient. 

To approximate the expectation of the joint distribution 
in the negative phase contribution to the gradient, it is 
natural to again consider exploiting the conditional in¬ 
dependence of the RBM to specify a Monte Carlo approx¬ 
imation of the expectation over the joint: 





RBM 

e 


{x,h) 





(16) 


with the samples (x^,h^) drawn by a block Gibbs Markov 
chain Monte Carlo (MCMC) sampling procedure: 

x w ~ P (x | h^), 

h® ~ P(h | x m ). 


Naively, for each gradient update step one would start a 
Gibbs sampling chain, wait until the chain converges to the 
equilibrium distribution, and then draw a sufficient number 
of samples to approximate the expected gradient with 
respect to the model (joint) distribution in (16). Then, restart 
the process for the next step of approximate gradient ascent 
on the log-likelihood. This procedure has the obvious flaw 
that waiting for the Gibbs chain to "burn-in" and reach 
equilibrium anew for each gradient update cannot form the 
basis of a practical training algorithm. CD [88], [94], SML 
[220], [204], and fast-weights persistent CD or fast-weights 






































































BENGIO ET AL.: REPRESENTATION LEARNING: A REVIEW AND NEW PERSPECTIVES 


1809 


persistent contrastive divergence (FPCD) [205] are all ways 
to avoid or reduce the need for burn in. 

6.4.1 CD 

CD estimation [88], [94] estimates the negative phase 
expectation (15) with a very short Gibbs chain (often just 
one step) initialized at the training data used in the positive 
phase. This reduces the variance of the gradient estimator 
and still moves in a direction that pulls the negative chain 
samples toward the associated positive chain samples. 
Much has been written about the properties and alternative 
interpretations of CD and its similarity to autoencoder 
training, for example, [42], [225], [14], [197]. 

6.4.2 SML 

The SML algorithm (also known as persistent contrastive 
divergence or PCD) [220], [204] is an alternative way to 
sidestep an extended burn in of the negative phase Gibbs 
sampler. At each gradient update, rather than initializing 
the Gibbs chain at the positive phase sample as in CD, SML 
initializes the chain at the last state of the chain used for the 
previous update. In other words, SML uses a continually 
running Gibbs chain (or often a number of Gibbs chains 
run in parallel) from which samples are drawn to estimate 
the negative phase expectation. Despite the model para¬ 
meters changing between updates, these changes should be 
small enough that only a few steps of Gibbs (in practice, 
often one step is used) are required to maintain samples 
from the equilibrium distribution of the Gibbs chain, i.e., 
the model distribution. 

A troublesome aspect of SML is that it relies on the Gibbs 
chain to mix well (especially between modes) for learning to 
succeed. Typically, as learning progresses and the weights of 
the RBM grow, the ergodicity of the Gibbs sample begins to 
break down. 12 If the learning rate e associated with gradient 
ascent 9 <— 9 + eg (with E[g\ k, dlo &p4 x ) ) no t re d ucec l to 
compensate, then the Gibbs sampler will diverge from the 
model distribution and learning will fail. Desjardins et al. 
[58], Cho et al. [44], and Salakhutdinov [173], [174] have all 
considered various forms of tempered transitions to address 
the failure of Gibbs chain mixing, and convincing solutions 
have not yet been clearly demonstrated. A recently intro¬ 
duced promising avenue relies on depth itself, relying on the 
observation that mixing between modes is much easier on 
deeper layers [27] (Section 9.4). 

Tieleman and Hinton [205] have proposed quite a 
different approach to addressing potential mixing problems 
of SML with their FPCD, and it has also been exploited to 
train DBMs [173] and construct a pure sampling algorithm 
for RBMs [39]. FPCD builds on the surprising but robust 
tendency of Gibbs chains to mix better during SML 
learning than when the model parameters are fixed. The 
phenomenon is rooted in the form of the likelihood 
gradient itself (15). The samples drawn from the SML 
Gibbs chain are used in the negative phase of the gradient, 

12. When weights become large, the estimated distribution is more 
peaky, and the chain takes a very long time to mix, to move from mode to 
mode, so that practically the gradient estimator can be very poor. This is a 
serious chicken-and-egg problem because if sampling is not effective, then 
neither is the training procedure, which may seem to stall and yield even 
larger weights. 


which implies that the learning update will slightly 
increase the energy (decrease the probability) of those 
samples, making the region in the neighborhood of those 
samples less likely to be resampled and therefore making it 
more likely that the samples will move somewhere else 
(typically going near another mode). Rather than drawing 
samples from the distribution of the current model (with 
parameters 9), FPCD exaggerates this effect by drawing 
samples from a local perturbation of the model with 
parameters 9* and an update: 

e t +i = (1 -vA+i + rjO* + e *^(^logP( a:W ))’ ( 17 ) 

where e* is the relatively large fast-weight learning rate 
(e* > e) and 0 < p < 1 (but near 1) is a forgetting factor that 
keeps the perturbed model close to the current model. 
Unlike tempering, FPCD does not converge to the model 
distribution as e and e* go to 0, and further work is 
necessary to characterize the nature of its approximation to 
the model distribution. Nevertheless, FPCD is a popular 
and apparently effective means of drawing approximate 
samples from the model distribution that faithfully repre¬ 
sent its diversity at the price of sometimes generating 
spurious samples in between two modes (because the fast 
weights roughly correspond to a smoothed view of the 
current model's energy function). It has been applied in a 
variety of applications [205], [165], [113], and it has been 
transformed into a sampling algorithm [39] that also shares 
this fast mixing property with herding [215] for the same 
reason, i.e., introducing negative correlations between con¬ 
secutive samples of the chain to promote faster mixing. 

6.4.3 Pseudolikelihood, Ratio Matching and More 
While CD, SML, and FPCD are by far the most popular 
methods for training RBMs and RBM-based models, all of 
these methods are perhaps most naturally described as 
offering different approximations to maximum likelihood 
training. There exist other inductive principles that are 
alternatives to maximum likelihood that can also be used to 
train RBMs. In particular, these include pseudolikelihood 
[32] and ratio matching [98]. Both of these inductive 
principles attempt to avoid explicitly dealing with the 
partition function, and their asymptotic efficiency has been 
analyzed [140]. Pseudolikelihood seeks to maximize the 
product of all ID conditional distributions of the form 
P(xd | x\ d ), while ratio matching can be interpreted as an 
extension of score matching [97] to discrete data types. Both 
methods amount to weighted differences of the gradient of 
the RBM free energy 13 evaluated at a data point and at 
neighboring points. One potential drawback of these 
methods is that depending on the parameterization of the 
energy function, their computational requirements may 
scale up to 0(n d ) worse than CD, SML, FPCD, or demising 
score matching [112], [210], discussed below. Marlin et al. 
[141] empirically compared all of these methods (except 
denoising score matching) on a range of classification, 

13. The free energy T[x\ 9) is the energy associated with the data 
marginal probability, F(x\6) = —log P(x) — log Zq and is tractable for 
the RBM. 
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reconstruction, and density modeling tasks and found that, 
in general, SML provided the best combination of overall 
performance and computational tractability. However, in a 
later study, the same authors [200] found denoising score 
matching to be a competitive inductive principle both in 
terms of classification performance (with respect to SML) 
and in terms of computational efficiency (with respect to 
analytically obtained score matching). Denoising score 
matching is a special case of the denoising autoencoder 
(DAE) training criterion (Section 7.2.2) when the reconstruc¬ 
tion error residual equals a gradient, i.e., the score function 
associated with an energy function, as shown in [210]. 

In the spirit of the Boltzmann machine gradient (15), 
several approaches have been proposed to train energy- 
based models. One is noise-contrastive estimation [80], in 
which the training criterion is transformed into a probabil¬ 
istic classification problem: distinguish between (positive) 
training examples and (negative) noise samples generated 
by a broad distribution (such as the Gaussian). Another 
family of approaches, more in the spirit of CD, relies on 
distinguishing positive examples (of the training distribu¬ 
tion) and negative examples obtained by perturbations of 
the positive examples [50], [33], [218]. 

7 Directly Learning a Parametric Map from 
Input to Representation 

Within the framework of probabilistic models adopted in 
Section 6, the learned representation is always associated 
with latent variables, specifically with their posterior 
distribution given an observed input x. Unfortunately, this 
posterior distribution tends to become very complicated 
and intractable if the model has more than a couple of 
interconnected layers, whether in the directed or undir¬ 
ected graphical model frameworks. It then becomes 
necessary to resort to sampling or approximate inference 
techniques, and to pay the associated computational and 
approximation error price. If the true posterior has a large 
number of modes that matter, then current inference 
techniques may face an unsurmountable challenge or 
endure a potentially serious approximation. This is in 
addition to the difficulties raised by the intractable partition 
function in undirected graphical models. Moreover, 
a posterior distribution over latent variables is not yet a 
simple usable feature vector that can, for example, be fed to a 
classifier. So, actual feature values are typically derived from 
that distribution, taking the latent variable's expectation 
(as is typically done with RBMs), their marginal probability, 
or finding their most likely value (as in sparse coding). If 
we are to extract stable deterministic numerical feature 
values in the end anyway, an alternative (apparently) 
nonprobabilistic feature learning paradigm that focuses on 
carrying out this part of the computation very efficiently is 
that of autoencoders and other directly parameterized 
feature or representation functions. The commonality 
between these methods is that they learn a direct encoding, 
i.e., a parametric map from inputs to their representation. 

Regularized autoencoders, discussed next, also involve 
learning a decoding function that maps back from 
representation to input space. Sections 8.1 and 11.3 discuss 


direct encoding methods that do not require a decoder, 
such as semi-supervised embedding [217] and slow feature 
analysis [219]. 

7.1 Autoencoders 

In the autoencoder framework [129], [37], [93], one starts by 
explicitly defining a feature-extracting function in a specific 
parameterized closed form. This function, which we will 
denote fo, is called the encoder and will allow the 
straightforward and efficient computation of a feature 
vector h = fo(x) from an input x. For each example x® 
from a dataset we define 

h {t) = fe(xM), (18) 

where h (L ' 1 is the feature vector or representation or code 
computed from x^K Another closed-form parameterized 
function gg, called the decoder, maps from feature space back 
into input space, producing a reconstruction r = gg(h). 
Whereas probabilistic models are defined from an explicit 
probability function and are trained to maximize (often 
approximately) the data likelihood (or a proxy), autoenco¬ 
ders are parameterized through their encoder and decoder 
and are trained using a different training principle. The set 
of parameters 6 of the encoder and decoder are learned 
simultaneously on the task of reconstructing as well as 
possibly the original input, i.e., attempting to incur the 
lowest possible reconstruction error L{x , r)—a measure of the 
discrepancy between x and its reconstruction r —over 
training examples. Good generalization means low recon¬ 
struction error at test examples, while having high 
reconstruction error for most other x configurations. To 
capture the structure of the data-generating distribution, it 
is therefore important that something in the training 
criterion or the parameterization prevents the autoencoder 
from learning the identity function, which has zero 
reconstruction error everywhere. This is achieved through 
various means in the different forms of autoencoders, as 
described below in more detail, and we call these regidarized 
autoencoders. A particular form of regularization consists of 
constraining the code to have a low dimension, and this is 
what the classical autoencoder or PCA do. 

In summary, basic autoencoder training consists in 
finding a value of parameter vector 0 minimizing recon¬ 
struction error: 

Jae(0) = '52 L (x®,g e (Mx®))), (19) 

t 

where x® is a training example. This minimization is 
usually carried out by stochastic gradient descent as in the 
training of MLPs. Since autoencoders were primarily 
developed as MLPs, predicting their input, the most 
commonly used forms for the encoder and decoder are 
affine mappings, optionally followed by a nonlinearity: 


fg(x) = S f (b + Wx ), 

(20) 

ge(h) = s g (d + W'h), 

(21) 


where sj and s g are the encoder and decoder activation 
functions (typically, the elementwise sigmoid or hyperbolic 
tangent nonlinearity, or the identity function if staying 
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linear). The set of parameters of such a model is 
9 = {W, b, W', d}, where b and d are called encoder and 
decoder bias vectors and W and W' are the encoder and 
decoder weight matrices. 

The choice of s g and L depends largely on the input 
domain range and nature and is usually chosen so that L 
returns a negative log likelihood for the observed value of x. 
A natural choice for an unbounded domain is a linear 
decoder with a squared reconstruction error, i.e., s g (a) = a 
and L(x,r) = ||a; — r|| 2 . If inputs are bounded between 0 and 
1 , however, ensuring a similarly bounded reconstruction 
can be achieved by using s g = sigmoid. In addition, if the 
inputs are of a binary nature, a binary cross-entropy loss 14 is 
sometimes used. 

If both encoder and decoder use a sigmoid nonlinearity, 
then fe{x) and gg(h) have the exact same form as the 
conditionals P(h \ v) and P(v \ h) of binary RBMs (see 
Section 6.2). This similarity motivated an initial study [23] of 
the possibility of replacing RBMs with autoencoders as the 
basic pretraining strategy for building deep networks, as 
well as the comparative analysis of autoencoder reconstruc¬ 
tion error gradient and CD updates [14]. 

One notable difference in the parameterization is that 
RBMs use a single weight matrix, which follows naturally 
from their energy function, whereas the autoencoder 
framework allows for a different matrix in the encoder 
and decoder. In practice, however, weight tying in which one 
defines W' = W T may be (and is most often) used, 
rendering the parameterizations identical. The usual train¬ 
ing procedures, however, differ greatly between the two 
approaches. A practical advantage of training autoencoder 
variants is that they define a simple tractable optimization 
objective that can be used to monitor progress. 

In the case of a linear autoencoder (linear encoder and 
decoder) with squared reconstruction error, minimizing 
(19) learns the same subspace 15 as PCA. This is also true 
when using a sigmoid nonlinearity in the encoder [37], but 
not if the weights W and W' are tied (W' = W T ) because W 
cannot be forced into being small and W' large to achieve a 
linear encoder. 

Similarly, Le et al. [124] recently showed that adding a 
regularization term of the form ss(WjX^) to a linear 

autoencoder with tied weights, where S 3 is a nonlinear 
convex function, yields an efficient algorithm for learning 
linear ICA. 

7.2 Regularized Autoencoders 

Like PCA, autoencoders were originally seen as a dimen¬ 
sionality reduction technique and thus used a bottleneck, i.e., 
dh < d x . On the other hand, successful uses of sparse coding 
and RBM approaches tend to favor overcomplete representa¬ 
tions, i.e., dh > d x . This can allow the autoencoder to simply 
duplicate the input in the features with perfect reconstruc¬ 
tion without having extracted more meaningful features. 
Recent research has demonstrated very successful alter¬ 
native ways, called regularized autoencoders, to "constrain" 

14. L(x, r) = - x i log(n) + (1 — 73 ) log(l — Ti). 

15. Contrary to traditional PCA loading factors, but similarly to the 
parameters learned by probabilistic PCA, the weight vectors learned by a 
linear autoencoder are not constrained to form an orthonormal basis nor to 
have a meaningful ordering. They will, however, span the same subspace. 


the representation, even when it is overcomplete. The effect 
of a bottleneck or of this regularization is that the 
autoencoder cannot reconstruct everything well that it is 
trained to reconstruct well; the training examples and 
generalization mean that reconstruction error is also small 
on test examples. An interesting justification [162] for the 
sparsity penalty (or any penalty that restricts in a soft way 
the volume of hidden configurations easily accessible by the 
learner) is that it acts in spirit like the partition function of 
RBMs by making sure that only few input configurations 
can have a low reconstruction error. 

Alternatively, one can view the objective of the regular¬ 
ization applied to an autoencoder as making the represen¬ 
tation as "constant" (insensitive) as possible with respect to 
changes in input. This view immediately justifies two 
variants of regularized autoencoders described below: 
Contractive autoencoders (CAEs) reduce the number of 
effective degrees of freedom of the representation (around 
each point) by making the encoder contractive, i.e., making 
the derivative of the encoder small (thus making the 
hidden units saturate), while the DAE makes the whole 
mapping "robust," i.e., insensitive to small random 
perturbations, or contractive, making sure that the recon¬ 
struction cannot stay good when moving in most directions 
around a training example. 

7.2.1 Sparse Autoencoders 

The earliest uses of single-layer autoencoders for building 
deep architectures by stacking them [23] considered the 
idea of tying the encoder weights and decoder weights to 
restrict capacity as well as the idea of introducing a form of 
sparsity regularization [161]. Sparsity in the representation 
can be achieved by penalizing the hidden unit biases 
(making these additive offset parameters more negative) 
[161], [134], [71], [118] or by directly penalizing the output 
of the hidden unit activations (making them closer to their 
saturating value at 0) [162], [123], [227]. Penalizing the bias 
runs the danger that the weights could compensate for the 
bias, which could hurt numerical optimization. When 
directly penalizing the hidden unit outputs, several variants 
can be found in the literature, but a clear comparative 
analysis is still lacking. Although the LI penalty (i.e., simply 
the sum of output elements hj in the case of sigmoid 
nonlinearity) would seem the most natural (because of its 
use in sparse coding), it is used in few papers involving 
sparse autoencoders. A close cousin of the LI penalty is the 
Student-t penalty (log(l + h?)), originally proposed for 
sparse coding [155]. Several papers penalize the average 
output hj (e.g., over a minibatch), and instead of pushing it 
to 0 encourage it to approach a fixed target, either through a 
mean-square error penalty or maybe, more sensibly 
(because hj behaves like a probability), a Kullback-Liebler 
divergence with respect to the binomial distribution with 
probability p: — ploghj — (1 — p) log(l — hj) + constant, for 
example, with p = 0.05. 

7.2.2 DAEs 

Vincent et al. [212], [213] proposed altering the training 
objective in (19) from mere reconstruction to that of 
demising an artificially corrupted input, i.e., learning to 
reconstruct the clean input from a corrupted version. 
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Corrupted input ^ / 



Fig. 3. When data concentrate near a lower dimensional manifold, the 
corruption vector is typically almost orthogonal to the manifold, and the 
reconstruction function learns to denoise, map from low-probability 
configurations (corrupted inputs) to high-probability ones (original 
inputs), creating a vector field aligned with the score (derivative of the 
estimated density). 

Learning the identity is no longer enough: The learner must 
capture the structure of the input distribution to optimally 
undo the effect of the corruption process, with the 
reconstruction essentially being a nearby but higher density 
point than the corrupted input. Fig. 3 illustrates that the 
DAE is learning a reconstruction function that corresponds 
to a vector field pointing toward high-density regions 
(the manifold where examples concentrate). 

Formally, the objective optimized by a DAE is 

J dae = ^ fE Q (a|sW) [L(x (t \ gg(fe{x)))\ , (22) 

t 

where [-] averages over corrupted examples x 

drawn from corruption process q(x \ x ®). In practice, this 
is optimized by stochastic gradient descent, where the 
stochastic gradient is estimated by drawing one or a few 
corrupted versions of x® each time x® is considered. 
Corruptions considered in Vincent et al. [213] include 
additive isotropic Gaussian noise, salt and pepper noise for 
gray-scale images, and masking noise (salt or pepper only), 
for example, setting some randomly chosen inputs to 0 
(independently per example). Masking noise has been used 
in most of the simulations. Qualitatively better features are 
reported with denoising, resulting in improved classifica¬ 
tion, and DAE features performed similarly or better than 
RBM features. Chen et al. [43] show that a simpler 
alternative with a closed-form solution can be obtained 
when restricting to a linear autoencoder and have success¬ 
fully applied it to domain adaptation. 

Vincent [210] relates DAEs to energy-based probabilistic 
models: DAEs basically learn in r(x) — x a vector pointing 
in the direction of the estimated score (Fig. 3). In the 

special case of linear reconstruction and squared error, 
Vincent [210] shows that training an affine-sigmoid-affine 
DAE amounts to learning an energy-based model whose 
energy function is very close to that of a GRBM. Training 
uses a regularized variant of the score matching parameter 
estimation technique [97], [99], [112] termed denoising score 
matching [210]. Swersky [199] had shown that training 
GRBMs with score matching is equivalent to training a 
regular autoencoder with an additional regularization term, 
while follow up on the theoretical results in [ 210 ], [ 200 ] 
showed the practical advantage of denoising to implement 


score matching efficiently. Finally, Alain and Bengio [1] 
generalize [210] and prove that DAEs of arbitrary para¬ 
meterization with small Gaussian corruption noise are 
general estimators of the score. 

7.2.3 CAEs 

CAEs, proposed by Rifai et al. [167], follow up on DAEs and 
share a similar motivation of learning robust representa¬ 
tions. CAEs achieve this by adding an analytic contractive 
penalty to (19), the Frobenius norm of the encoder's 
Jacobian, and results in penalizing the sensitivity of learned 
features to infinitesimal input variations. Let J(x) = j^(x) 
be the Jacobian matrix of the encoder at x. The CAE's 
training objective is 

3 cae = 22 L (z (t) ,go(Mx W ))) + A||J(a: (t) )|ft, (23) 

t 

where A is a hyperparameter controlling the strength of the 
regularization. For an affine sigmoid encoder, the contrac¬ 
tive penalty term is easy to compute: 

M x ) = fe( x )j(l ~ 

\\A X )\\ 2 F = I> W - fo( x )j)) 2 \\ w jf- m 

j 

There are at least three notable differences with DAEs, 
which may be partly responsible for the better performance 
that CAE features seem to empirically demonstrate: 1) The 
sensitivity of the features is penalized 16 rather than that of 
the reconstruction. 2) The penalty is analytic rather than 
stochastic: An efficiently computable expression replaces 
what might otherwise require d x corrupted samples to size 
up (i.e., the sensitivity in d x directions). 3) A hyperpara¬ 
meter A allows fine control of the tradeoff between 
reconstruction and robustness (while the two are mingled 
in a DAE). Note, however, that there is a tight connection 
between the DAE and the CAE: As shown in [1], a DAE 
with small corruption noise can be seen (through a Taylor 
expansion) as a type of contractive autoencoder where the 
contractive penalty is on the whole reconstruction function 
rather than just on the encoder . 17 

A potential disadvantage of the CAE's analytic penalty is 
that it amounts to only encouraging robustness to infinitesimal 
input variations. This is remedied in [168] with the 
CAE+H, which penalizes all higher order derivatives in 
an efficient stochastic manner by adding a term that 
encourages J{x) and J(x + e) to be close: 

JCAE+H = 5>(* (t) ,S»(* (<) )) + \\\J(X®)\\ 2 F 

t (25) 

+ 7 lE £ [||J(a:) - J(x + e)\\ 2 F ], 

where e ~ A/"(0, a 2 1), and 7 is the associated regularization 
strength hyperparameter. 

The DAE and CAE have been successfully used to win 
the final phase of the Unsupervised and Transfer Learning 
Challenge [145]. The representation learned by the CAE 

16. That is, the robustness of the representation is encouraged. 

17. But note that in the CAE, the decoder weights are tied to the encoder 
weights to avoid degenerate solutions, and this should also make the 
decoder contractive. 
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tends to be saturated rather than sparse, i.e., most hidden 
units are near the extremes of their range (e.g., 0 or 1), and 
their derivative is near 0. The nonsaturated units are 
few and sensitive to the inputs, with their associated filters 
(weight vectors) together forming a basis explaining the 
local changes around x, as discussed in Section 8.2. 
Another way to get saturated (nearly binary) units is 
semantic hashing [175]. 

7.2.4 PSD 

Sparse coding [155] may be viewed as a kind of auto¬ 
encoder that uses a linear decoder with a squared 
reconstruction error, but whose nonparametric encoder fg 
performs the comparatively nontrivial and relatively costly 
iterative minimization of (2). A practically successful 
variant of sparse coding and autoencoders, named PSD 
[109], replaces that costly and highly nonlinear encoding 
step by a fast noniterative approximation during recogni¬ 
tion (computing the learned features). PSD has been applied 
to object recognition in images and video [110], [111], [106], 
but also to audio [84], mostly within the framework of 
multistage convolutional deep architectures (Section 11.2). 
The main idea can be summarized by the following 
equation for the training criterion, which is simultaneously 
optimized with respect to hidden codes (representation) hW 
and with respect to parameters (W,a): 

Jpsd = £ X0% + ||a:W - Wh^Wl + ||/i« - /„(*«) f 2 , 

t 

(26) 

where x® is the input vector for example t, /i® is the 
optimized hidden code for that example, and / Q (-) is the 
encoding function, the simplest variant being 

f a (xV) = t anh (6 + W T x w ), (27) 

where encoding weights are the transpose of decoding 
weights. Many variants have been proposed, including the 
use of a shrinkage operation instead of the hyperbolic 
tangent [111]. Note that the LI penalty on h tends to make 
them sparse, and how this is the same criterion as sparse 
coding with dictionary learning (3) except for the additional 
constraint that one should be able to approximate the 
sparse codes h with a parameterized encoder f a {x). One 
can thus view PSD as an approximation to sparse coding, 
where we obtain a fast approximate encoder. Once PSD is 
trained, object representations f a (x) are used to feed a 
classifier. They are computed quickly and can be further 
fine tuned: The encoder can be viewed as one stage or one 
layer of a trainable multistage system such as a feedforward 
neural network. 

PSD can also be seen as a kind of autoencoder, where the 
codes h are given some freedom that can help to further 
improve reconstruction. One can also view the encoding 
penalty added on top of sparse coding as a kind of 
regularizer that forces the sparse codes to be nearly 
computable by a smooth and efficient encoder. This is in 
contrast with the codes obtained by complete optimization 
of the sparse coding criterion, which are highly nonsmooth 
or even nondifferentiable, a problem that motivated other 
approaches to smooth the inferred codes of sparse coding 


[4], so a sparse coding stage could be jointly optimized 
along with following stages of a deep architecture. 

8 Representation Learning as Manifold 
Learning 

Another important perspective on representation learning is 
based on the geometric notion of manifold. Its premise is the 
manifold hypothesis, according to which real-world data 
presented in high-dimensional spaces are expected to 
concentrate in the vicinity of a manifold M of much lower 
dimensionality embedded in high-dimensional input 
space HT* 1 . This prior seems particularly well suited for AI 
tasks such as those involving images, sounds, or text, for 
which most uniformly sampled input configurations are 
unlike natural stimuli. As soon as there is a notion of 
“representation," one can think of a manifold by consider¬ 
ing the variations in input space which are captured by or 
reflected (by corresponding changes) in the learned repre¬ 
sentation. To first approximation, some directions are well 
preserved (the tangent directions of the manifold), while 
others are not (directions orthogonal to the manifolds). With 
this perspective, the primary unsupervised learning task is 
then seen as modeling the structure of the data-supporting 
manifold. 18 The associated representation being learned can 
be associated with an intrinsic coordinate system on the 
embedded manifold. The archetypal manifold modeling 
algorithm is, not surprisingly, also the archetypal low¬ 
dimensional representation learning algorithm: PCA, which 
models a linear manifold. It was initially devised with the 
objective of finding the closest linear manifold to a cloud of 
data points. The principal components, i.e., the representa¬ 
tion fg(x) that PCA yields for an input point x, uniquely 
locates its projection on that manifold: It corresponds to 
intrinsic coordinates on the manifold. Data manifolds for 
complex real-world domains are, however, expected to be 
strongly nonlinear. Their modeling is sometimes approached 
as patchworks of locally linear tangent spaces [211], [38]. 
The large majority of algorithms built on this geometric 
perspective adopt a nonparametric approach, based on a 
training set nearest neighbor graph [181], [172], [203], [38], 
[7], [62], [214], [91], [209]. In these nonparametric ap¬ 
proaches, each high-dimensional training point has its 
own set of free low-dimensional embedding coordinates, 
which are optimized so that certain properties of the 
neighborhood graph computed in original high-dimen¬ 
sional input space are best preserved. These methods, 
however, do not directly learn a parameterized feature 
extraction function fg(x) applicable to new test points, 19 
which seriously limits their use as feature extractors, except 
in a transductive setting. Comparatively, few nonlinear 
manifold learning methods have been proposed that learn a 
parametric map that can directly compute a representation 
for new points; we will focus on these. 

18. Actually, data points need not strictly lie on the "manifold," but the 
probability density is expected to fall off sharply as one moves away from it, 
and it may actually be constituted of several possibly disconnected 
manifolds with different intrinsic dimensionality. 

19. For several of these techniques, representations for new points 
can be computed using the Nystrom approximation as has been 
proposed as an extension in [20], but this remains cumbersome and 
computationally expensive. 
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8.1 Learning a Parametric Mapping Based on a 
Neighborhood Graph 

Some of the above nonparametric manifold learning 
algorithms can be modified to learn a parametric mapping 
fe, i.e., applicable to new points: Instead of having free low¬ 
dimensional embedding coordinate "parameters" for each 
training point, these coordinates are obtained through an 
explicitly parameterized function, as with the parametric 
variant [208] of t-SNE [209], 

Instead, semi-supervised embedding [217] learns a direct 
encoding while taking into account the manifold hypothesis 
through a neighborhood graph. A parameterized neural 
network architecture simultaneously learns a manifold 
embedding and a classifier. The training criterion encourages 
training set neigbhors to have similar representations. 

The reduced and tightly controlled number of free 
parameters in such parametric methods, compared to their 
pure nonparametric counterparts, forces models to general¬ 
ize the manifold shape nonlocally [22], which can translate 
into better features and final performance [209]. However, 
basing the modeling of manifolds on training set neighbor¬ 
hood relationships might be risky statistically in high¬ 
dimensional spaces (sparsely populated due to the curse of 
dimensionality) as, for example, most euclidean nearest 
neighbors risk having too little in common semantically. 
The nearest neighbor graph is simply not sufficiently 
densely populated to map out satisfyingly the wrinkles of 
the target manifold [17], [22], [16]. It can also become 
problematic computationally to consider all pairs of data 
points, 20 which scales quadratically with training set size. 

8.2 Learning to Represent Nonlinear Manifolds 

Can we learn a manifold without requiring nearest 
neighbor searches? Yes, for example, with regularized 
autoencoders or PCA. In PCA, the sensitivity of the 
extracted components (the code) to input changes is the 
same regardless of position x. The tangent space is the same 
everywhere along the linear manifold. By contrast, for a 
nonlinear manifold, the tangent of the manifold changes as 
we move on the manifold, as illustrated in Fig. 6. In 
nonlinear representation-learning algorithms, it is conveni¬ 
ent to think about the local variations in the representation as 
the input x is varied on the manifold, i.e., as we move among 
high-probability configurations. As we discuss below, the 
first derivative of the encoder therefore specifies the shape 
of the manifold (its tangent plane) around an example x 
lying on it. If the density was really concentrated on the 
manifold and the encoder had captured that, we would find 
the encoder derivatives to be nonzero only in the directions 
spanned by the tangent plane. 

Let us consider sparse coding in this light: Parameter 
matrix W may be interpreted as a dictionary of input 
directions from which a different subset will be picked to 
model the local tangent space at an x on the manifold. That 
subset corresponds to the active, i.e., nonzero, features for 
input x. Nonzero component h, will be sensitive to small 
changes of the input in the direction of the associated 
weight vector W : t , whereas inactive features are more likely 

20. Even if pairs are picked stochastically, many must be considered 
before obtaining one that weighs significantly on the optimization objective. 


to be stuck at 0 until a significant displacement has taken 
place in input space. 

The local coordinate coding (LCC) algorithm [223] is very 
similar to sparse coding, but is explicitly derived from a 
manifold perspective. Using the same notation as that of 
sparse coding in (2), LCC replaces regularization term 
||/i (f) ||i = Yjj 1^1 yielding objective 

Jlcc = £ (\\x (t) - Wh^wl + xJ2\hf ) \\\w, J - z (t) H 1+J, Y 
t \ j J 

(28) 

This is identical to sparse coding when p = — 1, but with 
larger p it encourages the active anchor points for x® (i.e., the 
codebook vectors W : j with nonnegligible \h^\ that are 
combined to reconstruct x®) to be not too far from x-9; 
hence the local aspect of the algorithm. An important 
theoretical contribution of Yu et al. [223] is to show that that 
any Lipschitz-smooth function f : M —> IR defined on a 
smooth nonlinear manifold M embedded in lR da: can be 
well approximated by a globally linear function with respect 
to the resulting coding scheme (i.e., linear in h), where the 
accuracy of the approximation and required number dh of 
anchor points depend on d m rather than d x . This result has 
been further extended with the use of local tangent 
directions [222], as well as to multiple layers [137]. 

Let us now consider the efficient noniterative "feedfor¬ 
ward" encoders fo, used by PSD and the autoencoders 
reviewed in Section 7.2, which are in the form of (20) or (27). 
The computed representation for x will only be significantly 
sensitive to input space directions associated with non- 
saturated hidden units (see, e.g., (24) for the Jacobian of a 
sigmoid layer). These directions to which the representation 
is significantly sensitive, like in the case of PCA or sparse 
coding, may be viewed as spanning the tangent space of the 
manifold at point x. 

Rifai et al. [167] empirically analyze in this light the 
singular value spectrum of the Jacobian (derivatives of 
representation vector with respect to input vector) of a 
trained CAE. Here, the SVD provides an ordered ortho¬ 
normal basis of most sensitive directions. The spectrum is 
sharply decreasing, indicating a relatively small number of 
significantly sensitive directions. This is taken as empirical 
evidence that the CAE indeed modeled the tangent space of 
a low-dimensional manifold. The leading singular vectors 
form a basis for the tangent plane of the estimated 
manifold, as illustrated in Fig. 4. The CAE criterion is 
believed to achieve this thanks to its two opposing terms: 
The isotropic contractive penalty that encourages the 
representation to be equally insensitive to changes in any 
input directions, and the reconstruction term that pushes 
different training points (in particular neighbors) to have a 
different representation (so they may be reconstructed 
accurately), thus counteracting the isotropic contractive 
pressure only in directions tangent to the manifold. 

Analyzing learned representations through the lens of 
the spectrum of the Jacobian and relating it to the notion of 
tangent space of a manifold is feasible whenever the 
mapping is differentiable and regardless of how it was 
learned, whether as direct encoding (as in autoencoder 
variants) or derived from latent variable inference (as in 
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Fig. 4. The tangent vectors to the high-density manifold as estimated by 
a contractive autoencoder [167], The original input is shown on the top 
left. Each tangent vector (images on right side of first row) corresponds 
to a plausible additive deformation of the original input, as illustrated on 
the second row, where a bit of the third singular vector is added to the 
original, to form a translated and deformed image. Unlike in PCA, the 
tangent vectors are different for different inputs because the estimated 
manifold is highly nonlinear. 

sparse coding or RBMs). Exact low-dimensional manifold 
models (like PCA) would yield nonzero singular values 
associated to directions along the manifold, and exact zeros 
for directions orthogonal to the manifold. But in smooth 
models like the CAE or the RBM, we will instead have large 
versus relatively small singular values (as opposed to 
nonzero versus exactly zero). 

8.3 Leveraging the Modeled Tangent Spaces 

The local tangent space, at a point along the manifold, can 
be thought of as capturing locally valid transformations 
that were prominent in the training data. For example, 
Rifai et al. [169] examine the tangent directions extracted 
with an SVD of the Jacobian of CAEs trained on digits, 
images, or text-document data: They appear to correspond 
to small translations or rotations for images or digits and 
to substitutions of words within the same theme for 
documents. Such very local transformations along a data 
manifold are not expected to change class identity. To 
build their MTC, Rifai et al. [169] then apply techniques 
such as tangent distance [189] and tangent propagation [188], 
which were initially developed to build classifiers that are 
insensitive to input deformations provided as prior 
domain knowledge. Now, these techniques are applied 
using the local leading tangent directions extracted by a 
CAE, i.e., not using any prior domain knowledge (except 
the broad prior about the existence of a manifold). This 
approach set a new record for MNIST digit classification 
among prior-knowledge free approaches. 21 

9 Connections between Probabilistic and 
Direct Encoding Models 

The standard likelihood framework for probabilistic models 
decomposes the training criterion for models with para¬ 
meters 9 in two parts: the log-likelihood logP(a: | 9) (or 
logP(a; | h, 9) with latent variables h) and the prior log.P(0) 
(or log P(h | 9) + log P(6) with latent variables). 

9.1 PSD: A Probabilistic Interpretation 

In the case of the PSD algorithm, a connection can be made 
between the above standard probabilistic view and the 
direct encoding computation graph. The probabilistic 
model of PSD is the same directed generative model 

21. It yielded 0.81 percent error rate using the full MNIST training set, 
with no prior deformations, and no convolution. 


P(x | h ) of sparse coding (Section 6.1.1), which only 
accounts for the decoder. The encoder is viewed as an 
approximate inference mechanism to guess P(h \ x) and 
initialize a MAP iterative inference (where the sparse prior 
P(h) is taken into account). However, in PSD, the encoder is 
trained jointly with the decoder, rather than simply taking the 
end result of iterative inference as a target to approximate. 
An interesting view 22 to reconcile these facts is that the 
encoder is a parametric approximation for the MAP solution of a 
variational lower bound on the joint log likelihood. When MAP 
learning is viewed as a special case of variational learning 
(where the approximation of the joint log likelihood is 
with a dirac distribution located at the MAP solution), 
the variational recipe tells us to simultaneously improve the 
likelihood (reduce reconstruction error) and improve the 
variational approximation (reduce the discrepancy between 
the encoder output and the latent variable value). Hence, 
PSD sits at the intersection of probabilistic models (with 
latent variables) and direct encoding methods (which 
directly parameterize the mapping from input to represen¬ 
tation). RBMs also sit at the intersection because their 
particular parameterization includes an explicit mapping 
from input to representation, thanks to the restricted 
connectivity between hidden units. However, this nice 
property does not extend to their natural deep general¬ 
izations, i.e., DBMs, discussed in Section 10.2. 

9.2 Regularized Autoencoders Capture Local 
Structure of the Density 

Can we also say something about the probabilistic inter¬ 
pretation of regularized autoencoders? Their training 
criterion does not fit the standard likelihood framework 
because this would involve a data-dependent "prior." An 
interesting hypothesis emerges to answer that question, out 
of recent theoretical results [210], [1]: The training criterion 
of regularized autoencoders, instead of being a form of 
maximum likelihood, corresponds to a different inductive 
principle, such as score matching. The score matching 
connection is discussed in Section 7.2.2 and has been shown 
for a particular parametrization of DAE and equivalent 
GRBM [210]. The work in [1] generalizes this idea to a 
broader class of parameterizations (arbitrary encoders and 
decoders) and shows that by regularizing the autoencoder 
so that it be contractive, one obtains that the reconstruction 
function and its derivative estimate first and second derivatives 
of the underlying data-generative density. This view can be 
exploited to successfully sample from autoencoders, as 
shown in [170], [26]. The proposed sampling algorithms 
are MCMCs similar to Langevin MCMC, using not just the 
estimated first derivative of the density, but also the 
estimated manifold tangents so as to stay close to manifolds 
of high density. 

This interpretation connects well with the geometric 
perspective introduced in Section 8. The regularization 
effects (e.g., due to a sparsity regularizer, a contractive 
regularizer, or the denoising criterion) ask the learned 
representation to be as insensitive as possible to the input, 
while minimizing reconstruction error on the training 
examples forces the representation to contain just enough 

22. Suggested by Ian Goodfellow, personal communication. 
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Fig. 5. Reconstruction function r(x) (green) learned by a high-capacity 
autoencoder on ID input, minimizing reconstruction error at training 
examples x (l> (r(x <t ' 1 ) in red) while trying to be as constant as possible 
otherwise. The dotted line is the identity reconstruction (which might be 
obtained without the regularizer). The blue arrows show the vector field 
of r(x) - x pointing toward high-density peaks estimated by the model 
and estimating the score (log-density derivative). 

information to distinguish them. The solution is that 
variations along the high-density manifolds are preserved, 
while other variations are compressed: The reconstruction 
function should be as constant as possible while reprodu¬ 
cing training examples, i.e., points near a training example 
should be mapped to that training example (Fig. 5). The 
reconstruction function should map an input toward the 
nearest point manifold, i.e., the difference between 
reconstruction and input is a vector aligned with the 
estimated score (the derivative of the log density with 
respect to the input). The score can be zero on the 
manifold (where reconstruction error is also zero) at local 
maxima of the log density, but it can also be zero at local 
minima. It means that we cannot equate low reconstruction 
error with high estimated probability. The second deriva¬ 
tives of the log density correspond to the first derivatives 
of the reconstruction function, and on the manifold (where 
the first derivative is 0) they indicate the tangent directions 
of the manifold (where the first derivative remains near 0). 

As illustrated in Fig. 6, the basic idea of the auto¬ 
encoder sampling algorithms in [170], [26] is to make 
MCMC moves where one 1) moves toward the manifold 
by following the density gradient (i.e., applying a 
reconstruction) and 2) adds noise in the directions of the 


Fig. 6. Sampling from regularized autoencoders [170], [26]: Each MCMC 
step adds to current state x the noise 6, mostly in the direction of the 
estimated manifold tangent plane H and projects back toward the 
manifold (high-density regions) by performing a reconstruction step. 


Fig. 7. Top: Early, during training, MCMC mixes easily between modes 
because the estimated distribution has high entropy and puts enough 
mass everywhere for small-steps movements (MCMC) to go from mode 
to mode. Bottom: Later on, training relying on good mixing can stall 
because estimated modes are separated by wide low-density deserts. 

leading singular vectors of the reconstruction (or encoder) 
Jacobian, corresponding to those associated with smallest 
second derivative of the log density. 

9.3 Learning Approximate Inference 

Let us now consider from closer how a representation is 
computed in probabilistic models with latent variables, 
when iterative inference is required. There is a computation 
graph (possibly with random number generation in some of 
the nodes, in the case of MCMC) that maps inputs to 
representation and, in the case of deterministic inference 
(e.g., MAP inference or variational inference), that function 
could be optimized directly. This is a way to generalize PSD 
that has been explored in recent work on probabilistic 
models at the intersection of inference and learning [4], [75], 
[79], [177], [195], [63], where a central idea is that instead of 
using a generic inference mechanism, one can use one that is 
learned and is more efficient, taking advantage of the 
specifics of the type of data on which it is applied. 

9.4 Sampling Challenges 

A troubling challenge with many probabilistic models with 
latent variables like most Boltzmann machine variants is 
that good MCMC sampling is required as part of the 
learning procedure, but that sampling becomes extremely 
inefficient (or unreliable) as training progresses because the 
modes of the learned distribution become sharper, making 
mixing between modes very slow. Whereas initially during 
training a learner assigns mass almost uniformly, as 
training progresse, its entropy decreases, approaching the 
entropy of the target distribution as more examples and 
more computation are provided. According to our manifold 
and natural clustering priors of Section 3.1, the target 
distribution has sharp modes (manifolds) separated by 
extremely low density areas. Mixing then becomes more 
difficult because MCMC methods, by their very nature, 
tend to make small steps to nearby high-probability 
configurations. This is illustrated in Fig. 7. 

Bengio et al. [27] suggest that deep representations 
could help mixing between such well-separated modes, 
based on both theoretical arguments and on empirical 
evidence. The idea is that if higher level representations 
disentangle the underlying abstract factors better, then 
small steps in this abstract space (e.g., swapping from 
one category to another) can easily be done by MCMC. 
The high-level representations can then be mapped back to 
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the input space to obtain input-level samples, as in the 
DBNs sampling algorithm [94]. This has been demon¬ 
strated both with DBNs and with the newly proposed 
algorithm for sampling from contracting and DAEs [170], 
[26]. This observation alone does not suffice to solve the 
problem of training a DBN or a DBM, but it may provide a 
crucial ingredient, and it makes it possible to consider 
successfully sampling from deep models trained by 
procedures that do not require an MCMC, like the stacked 
regularized autoencoders used in [170]. 

9.5 Evaluating and Monitoring Performance 

It is always possible to evaluate a feature learning 
algorithm in terms of its usefulness with respect to a 
particular task (e.g., object classification), with a predictor 
that is fed or initialized with the learned features. In 
practice, we do this by saving the features learned (e.g., at 
regular intervals during training, to perform early stop¬ 
ping) and training a cheap classifier on top (such as a linear 
classifier). However, training the final classifier can be a 
substantial computational overhead (e.g., supervised fine 
tuning a deep neural network usually takes more training 
iterations than the feature learning itself), so we may want 
to avoid having to train a classifier for every training 
iteration of the unsupervised learner and every hyperpara¬ 
meter setting. More importantly, this may give an incom¬ 
plete evaluation of the features (what would happen for 
other tasks?). All these issues motivate the use of methods 
to monitor and evaluate purely unsupervised performance. 
This is rather easy with all the autoencoder variants (with 
some caution outlined below) and rather difficult with the 
undirected graphical models such as the RBM and 
Boltzmann machines. 

For autoencoder and sparse coding variants, test set 
reconstruction error can be readily computed, but by itself 
may be misleading because larger capacity (e.g., more 
features, more training time) tends to systematically lead 
to lower reconstruction error, even on the test set. Hence, it 
cannot be used reliably for selecting most hyperparameters. 
On the other hand, demising reconstruction error is clearly 
immune to this problem, so that solves the problem for 
DAEs. Based on the connection between DAEs and CAEs 
uncovered in [26], [1], this immunity can be extended to 
DAEs, but not to the hyperparameter controlling the 
amount of noise or of contraction. 

For RBMs and some (not too deep) Boltzmann ma¬ 
chines, one option is the use of annealed importance 
sampling [150] to estimate the partition function (and thus 
the test log likelihood). Note that this estimator can have 
high variance and that it becomes less reliable (variance 
becomes too large) as the model becomes more interesting, 
with larger weights, more nonlinearity, sharper modes, and 
a sharper probability density function (see our previous 
discussion in Section 9.4). Another interesting and recently 
proposed option for RBMs is to track the partition function 
during training [59], which could be useful for early 
stopping and reducing the cost of ordinary AIS. For toy 
RBMs (e.g., 25 hidden units or less, or 25 inputs or less), 
the exact log likelihood can also be computed analytically, 
and this can be a good way to debug and verify some 
properties of interest. 


10 Global Training of Deep Models 

One of the most interesting challenges raised by deep 
architectures is: How should we jointly train all the levels ? In 
the previous section and in Section 4, we have only 
discussed how single-layer models could be combined to 
form a deep model. Here, we consider joint training of all 
the levels and the difficulties that may arise. 

10.1 The Challenge of Training Deep Architectures 

Higher level abstraction means more nonlinearity. It means 
that two nearby input configurations may be interpreted 
very differently because a few surface details change the 
underlying semantics, whereas most other changes in the 
surface details would not change the underlying semantics. 
The representations associated with input manifolds may 
be complex because the mapping from input to representa¬ 
tion may have to unfold and distort input manifolds that 
generally have complicated shapes into spaces where 
distributions are much simpler, where relations between 
factors are simpler, maybe even linear or involving many 
(conditional) independencies. Our expectation is that 
modeling the joint distribution between high-level abstrac¬ 
tions and concepts should be much easier in the sense of 
requiring much less data to learn. The hard part is learning 
a good representation that does this unfolding and 
disentangling. This may be at the price of a more difficult 
training problem, possibly involving ill conditioning and 
local minima. 

It is only since 2006 that researchers have seriously 
investigated ways to train deep architectures, to the excep¬ 
tion of the convolutional networks [133]. The first realization 
(Section 4) was that unsupervised or supervised layerwise 
training was easier and that this could be taken advantage of 
by stacking single-layer models into deeper ones. 

It is interesting to ask why does the layerwise unsupervised 
pretraining procedure sometimes help a supervised learner [65]. 
There seems to be a more general principle at play 23 of 
guiding the training of intermediate representations, which may 
be easier than trying to learn it all in one go. This is nicely 
related to the curriculum learning idea [24] that it may be 
much easier to learn simpler concepts first and then build 
higher level ones on top of simpler ones. This is also 
coherent with the success of several deep learning algo¬ 
rithms that provide some such guidance for intermediate 
representations, like semi-supervised embedding [217]. 

The question of why unsupervised pretraining could be 
helpful was extensively studied [65], trying to dissect the 
answer into a regularization effect and an optimization effect. 
The regularization effect is clear from the experiments, 
where the stacked RBMs or DAEs are used to initialize a 
supervised classification neural network [65]. It may simply 
come from the use of unsupervised learning to bias the 
learning dynamics and initialize it in the basin of attraction of 
a "good" local minimum (of the training criterion), where 
"good" is in terms of generalization error. The underlying 
hypothesis exploited by this procedure is that some of the 
features or latent factors that are good at capturing the 
leading variations in the input distribution are also good at 

23. First suggested to us by Leon Bottou. 
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capturing the variations in the target output random 
variables of interest (e.g., classes). The optimization effect 
is more difficult to tease out because the top two layers of a 
deep neural net can just overfit the training set whether the 
lower layers compute useful features or not, but there are 
several indications that optimizing the lower levels with 
respect to a supervised training criterion can be challenging. 

One such indication is that changing the numerical 
conditions of the optimization procedure can have a 
profound impact on the joint training of a deep architec¬ 
ture, for example, by changing the initialization range and 
changing the type of nonlinearity used [68], much more so 
than with shallow architectures. One hypothesis to explain 
some of the difficulty in the optimization of deep 
architectures is centered on the singular values of the 
Jacobian matrix associated with the transformation from the 
features at one level into the features at the next level [68]. If 
these singular values are all small (less than 1), then the 
mapping is contractive in every direction and gradients 
would vanish when propagated backward through many 
layers. This is a problem already discussed for recurrent 
neural networks [18], which can be seen as very deep 
networks with shared parameters at each layer, when 
unfolded in time. This optimization difficulty has moti¬ 
vated the exploration of second-order methods for deep 
architectures and recurrent networks, in particular Hessian- 
free second-order methods [142], [143]. Unsupervised 
pretraining has also been proposed to help training 
recurrent networks and temporal RBMs [198], i.e., at each 
time step, there is a local signal to guide the discovery of 
good features to capture in the state variables: model with 
the current state (as hidden units) the joint distribution of 
the previous state and the current input. Natural gradient 
[2] methods that can be applied to networks with millions 
of parameters (i.e., with good scaling properties) have also 
been proposed [127], [157]. Cho et al. [45] propose using 
adaptive learning rates for RBM training, along with a 
novel and interesting idea for a gradient estimator that 
takes into account the invariance of the model to flipping 
hidden unit bits and inverting signs of corresponding 
weight vectors. At least one study indicates that the choice 
of initialization (to make the Jacobian of each layer closer to 
1 across all its singular values) could substantially reduce 
the training difficulty of deep networks [68] and this is 
coherent with the success of the initialization procedure of 
echo state networks [104], as recently studied by Sutskever 
[196]. There are also several experimental results [68], [69], 
[151] showing that the choice of hidden units nonlinearity 
could influence both training and generalization perfor¬ 
mance, with particularly interesting results obtained with 
sparse rectifying units [106], [151], [69], [117], An old idea 
regarding the ill-conditioning issue with neural networks is 
that of symmetry breaking: Part of the slowness of conver¬ 
gence may be due to many units moving together (like 
sheep) and all trying to reduce the output error for the same 
examples. By initializing with sparse weights [142] or by 
using often saturated nonlinearities (such as rectifiers as 
max-pooling units), gradients only flow along a few paths, 
which may help hidden units to specialize more quickly. 
Another promising idea to improve the conditioning of 


neural network training is to nullify the average value and 
slope of each hidden unit output [158], and possibly locally 
normalize magnitude as well [106]. The debate still rages 
between using online methods such as stochastic gradient 
descent and using second-order methods on large mini¬ 
batches (of several thousand examples) [142], [123], with a 
variant of stochastic gradient descent recently winning an 
optimization challenge. 24 

Finally, several recent results exploiting large quantities of 
labeled data suggest that with proper initialization and 
choice of nonlinearity, very deep purely supervised net¬ 
works can be trained successfully without any layerwise 
pretraining [47], [69], [183], [117]. Researchers report that in 
such conditions, layerwise unsupervised pretraining 
brought little or no improvement over pure supervised 
learning from scratch when training for long enough. This 
reinforces the hypothesis that unsupervised pretraining acts 
as a prior, which may be less necessary when very large 
quantities of labeled data are available, but which begs 
the question of why this had not been discovered earlier. 
The latest results reported in this respect [117] are 
particularly interesting because they allowed drastically 
reducing the error rate of object recognition on a benchmark 
(the 1,000-class ImageNet task), where many more tradi¬ 
tional computer vision approaches had been evaluated 
(http://www. image-net.org/challenges/LSVRC/2012/ 
results.html). The main techniques that allowed this success 
include the following: efficient GPU training allowing one to 
train longer (more than 100 million visits of examples), an 
aspect first reported by Lee et al. [135] and Ciresan et al. 
[47], a large number of labeled examples, artificially transformed 
examples (see Section 11.1), a large number of tasks (1,000 or 
10,000 classes for ImageNet), convolutional architecture with 
max-pooling (see Section 11 for these latter two techniques), 
rectifying nonlinearities (discussed above), careful initialization 
(discussed above), careful parameter update and adaptive 
learning rate heuristics, layerwise feature normalization (across 
features), and a new dropout trick based on injecting strong 
binary multiplicative noise on hidden units. This trick is 
similar to the binary noise injection used at each layer of a 
stack of DAEs. Future work is hopefully going to help 
identify which of these elements matter most, how to 
generalize them across a large variety of tasks and 
architectures, and, in particular, contexts where most 
examples are unlabeled, i.e., including an unsupervised 
component in the training criterion. 

10.2 Joint Training of DBMs 

We now consider the problem of joint training of all layers 
of a specific unsupervised model, the DBM. Whereas much 
progress (albeit with many unanswered questions) has been 
made on jointly training all the layers of deep architectures 
using back-propagated gradients (i.e., mostly in the 
supervised setting), much less work has been done on their 
purely unsupervised counterpart, for example, with 
DBMs. 25 Note, however, that one could hope that the 

24. https://sites.google.com/site/nips2011workshop/optimization- 
challenges. 

25. Joint training of all the layers of a deep belief net is much more 
challenging because of the much harder inference problem involved. 
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successful techniques described in the previous section 
could be applied to unsupervised learning algorithms. 

Like the RBM, the DBM is another particular subset of 
the Boltzmann machine family of models where the units 
are again arranged in layers. However, unlike the RBM, the 
DBM possesses multiple layers of hidden units, with units 
in odd-numbered layers being conditionally independent 
given even-numbered layers, and vice versa. With respect 
to the Boltzmann energy function of (7), the DBM 
corresponds to setting U = 0 and a sparse connectivity 
structure in both V and 11'. We can make the structure of the 
DBM more explicit by specifying its energy function. For 
the model with two hidden layers, it is given as 

£% BM (v,h {1 \h {2) ;9) = -v T Wh {1) - /i (1)T Wi (2) 

_ d m T h m _ d m T h m _ # Vt (29) 

with 9 = {W, V, dP\ 6}. The DBM can also be charac¬ 
terized as a bipartite graph between two sets of vertices, 
formed by odd- and even-numbered layers (with v := h^). 


10.2.1 Mean-Field Approximate Inference 
A key point of departure from the RBM is that the posterior 
distribution over the hidden units (given the visibles) is no 
longer tractable due to the interactions between the hidden 
units. Salakhutdinov and Hinton [176] resort to a mean- 
field approximation to the posterior. Specifically, in the case 
of a model with two hidden layers, we wish to approximate 
P(h^\h^ | v) with the factored distribution Q v (h^\h^) = 
n^Qv(h? ] ) such that the KL divergence 
KL(P(h < ' 1 \h^ \ v)\\Q v {h 1 ,h 2 )) is minimized or, equiva¬ 
lently, that a lower bound to the log likelihood is maximized: 


log P{v) > C(Q V ) = £X> (h«, h (2) ) 

ftM h< 2 > 


log 


\Q v (hM,h( 2) ))' 


(30) 


Maximizing this lower bound with respect to the mean- 
field distribution Q v {h},h 2 ) (by setting derivatives to zero) 
yields the following mean field update equations: 


?(i) 


sigmoid 


£1 Wji y j + £] Vikh\? 

j k 



(31) 


/£ sigmoid ^ £] Vikh ^ . (32) 

Note how the above equations ostensibly look like a fixed 
point recurrent neural network, i.e., with constant input. In the 
same way that an RBM can be associated with a simple 
autoencoder, the above mean-field update equations for the 
DBM can be associated with a recurrent autoencoder. In that 
case, the training criterion involves the reconstruction error 
at the last or at consecutive time steps. This type of model 
has been explored by Savard [179] and Seung [186] and 
shown to do a better job at denoising than ordinary 
autoencoders. 

Iterating (31-32) until convergence yields the Q para¬ 
meters of the "variational positive phase" of 


C(Q V ) = IEq„ [log P{v, h^,h^)-\ogQ v (h^\h^)] 

= Eq„[— 4 ,BM (ri,/i (1) ,/i (2) ) - logQ„(ft (1 U (21 )] 
- log Z e , 


3C(Q V ) 

39 


= -1Eq„ 


+ IEj> 


90 

d£^ M (yM 1 \h { V) 1 

39 


(33) 


This variational learning procedure leaves the "negative 
phase" untouched, which can thus be estimated through 
SML or CD [89] as in the RBM case. 


10.2.2 Training DBMs 

The major difference between training a DBM and an RBM 
is that instead of maximizing the likelihood directly, we 
instead choose parameters to maximize the lower bound on 
the likelihood given in (30). The SML-based algorithm for 
maximizing this lower bound is as follows: 

1. Clamp the visible units to a training example. 

2. Iterate over (31)-(32) until convergence. 

3. Generate negative phase samples v~, hf9~, and ft/ 2 ) - 
through SML. 

4. Compute 3C(Q v )/39 using the values obtained in 
steps 2 and 3. 

5. Finally, update the model parameters with a step of 
approximate stochastic gradient ascent. 

While the above procedure appears to be a simple 
extension of the highly effective SML scheme for training 
RBMs, as we demonstrate in Desjardins et al. [60], this 
procedure seems vulnerable to falling in poor local 
minima that leave many hidden units effectively dead 
(not significantly different from its random initialization 
with small norm). 

The failure of the SML joint training strategy was noted 
by Salakhutdinov and Hinton [176]. As an alternative, they 
proposed a greedy layerwise training strategy. This 
procedure consists of pretraining the layers of the DBM, 
in much the same way as the DBN, i.e., by stacking RBMs 
and training each layer to independently model the output 
of the previous layer. A final joint "fine tuning" is done 
following the above SML-based procedure. 


11 Building-In Invariance 

It is well understood that incorporating prior domain 
knowledge helps machine learning. Exploring good strate¬ 
gies for doing so is a very important research avenue. 
However, if we are to advance our understanding of core 
machine learning principles, it is important that we keep 
comparisons between predictors fair and maintain a clear 
awareness of the prior domain knowledge used by different 
learning algorithms, especially when comparing their 
performance on benchmark problems. We have so far only 
presented algorithms that exploited only generic inductive 
biases for high-dimensional problems, thus making them 
potentially applicable to any high-dimensional problem. 
The most prevalent approach to incorporating prior knowl¬ 
edge is to hand design better features to feed a generic 
classifier and this has been used extensively in computer 
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vision (e.g., [138]). Here, we rather focus on how basic 
domain knowledge of the input, in particular its topological 
structure (e.g., bitmap images having a 2D structure), may 
be used to learn better features. 

11.1 Generating Transformed Examples 

Generalization performance is usually improved by 
providing a larger quantity of representative data. This 
can be achieved by generating new examples by applying 
small random deformations to the original training 
examples, using deformations that are known not to 
change the target variables of interest, for example, an 
object class is invariant to small transformations of images 
such as translations, rotations, scaling, or shearing. This 
old approach [5] has recently been applied with great 
success in the work of Ciresan et al. [47], who used an 
efficient GPU implementation (40 x speedup) to train a 
standard but large deep MLP on deformed MNIST digits. 
Using both affine and elastic deformations [187], with 
plain old stochastic gradient descent, they reach a record 
0.32 percent classification error rate. 

11.2 Convolution and Pooling 

Another powerful approach is based on even more basic 
knowledge of merely the topological structure of the input 
dimensions. By this we mean, for example, the 2D layout of 
pixels in images or audio spectrograms, the 3D structure of 
videos, the ID sequential structure of text, or of temporal 
sequences in general. Based on such structure, one can 
define local receptive fields [95] so that each low-level feature 
will be computed from only a subset of the input: a 
neighborhood in the topology (e.g., a subimage at a given 
position). This topological locality constraint corresponds to 
a layer having a very sparse weight matrix with nonzeros 
only allowed for topologically local connections. Comput¬ 
ing the associated matrix products can, of course, be made 
much more efficient than having to handle a dense matrix, 
in addition to the statistical gain from a much smaller 
number of free parameters. In domains with such topolo¬ 
gical structure, similar input patterns are likely to appear at 
different positions, and nearby values (e.g., consecutive 
frames or nearby pixels) are likely to have stronger 
dependencies that are also important to model the data. 
In fact, these dependencies can be exploited to discover the 
topology [126], i.e., recover a regular grid of pixels out of a 
set of vectors without any order information, for example, 
after the elements have been arbitrarily shuffled in the same 
way for all examples. Thus, the same local feature 
computation is likely to be relevant at all translated 
positions of the receptive field. Hence, the idea of sweeping 
such a local feature extractor over the topology: This 
corresponds to a convolution and transforms an input into 
a similarly shaped feature map. Equivalently to sweeping, 
this may be seen as static but differently positioned 
replicated feature extractors that all share the same 
parameters. This is at the heart of convolutional networks 
[131], [133] that have been applied both to object recognition 
and to image segmentation [207]. Another hallmark of the 
convolutional architecture is that values computed by the 
same feature detector applied at several neighboring input 
locations are then summarized through a pooling operation. 


typically taking their max or their sum. This confers upon 
the resulting pooled feature layer some degree of invariance 
to input translations, and this style of architecture (alter¬ 
nating selective feature extraction and invariance-creating 
pooling) has been the basis of convolutional networks, the 
Neocognitron [67] and HMAX [166] models, and argued to 
be the architecture used by mammalian brains for object 
recognition [166], [185], [61]. The output of a pooling unit 
will be the same irrespective of where a specific feature is 
located inside its pooling region. Empirically, the use of 
pooling seems to contribute significantly to improved 
classification accuracy in object classification tasks [133], 
[35], [36]. A successful variant of pooling connected to 
sparse coding is L2 pooling [103], [110], [122], for which the 
pool output is the square root of the possibly weighted sum 
of squares of filter outputs. Ideally, we would like to 
generalize feature pooling so as to learn what features should 
be pooled together, for example, as successfully done in 
several papers [100], [110], [122], [160], [53], [49], [76]. In this 
way, the pool output learns to be invariant to the variations 
captured by the span of the features pooled. 

11.2.1 Patch-Based Training 

The simplest approach for learning a convolutional layer in 
an unsupervised fashion is patch-based training: simply 
feeding a generic unsupervised feature learning algorithm 
with local patches extracted at random positions of the 
inputs. The resulting feature extractor can then be swiped 
over the input to produce the convolutional feature maps. 
That map may be used as a new input for the next layer, 
and the operation repeated to thus learn and stack several 
layers. Such an approach was recently used with indepen¬ 
dent subspace analysis [125] on 3D video blocks, reaching 
the state of the art on the Hollywood2, UCF, KTH, and 
YouTube action recognition datasets. Similarly, Coates and 
Ng [48] compared several feature learners with patch-based 
training and reached state-of-the-art results on several 
classification benchmarks. Interestingly, in this work, 
performance was almost as good with very simple k-means 
clustering as with more sophisticated feature learners. We, 
however, conjecture that this is the case only because 
patches are rather low-dimensional (compared to the 
dimension of a whole image). A large dataset might 
provide sufficient coverage of the space of, for example, 
edges prevalent in 6 x 6 patches so that a distributed 
representation is not absolutely necessary. Another plau¬ 
sible explanation for this success is that the clusters 
identified in each image patch are then pooled into a 
histogram of cluster counts associated with a larger 
subimage. Whereas the output of a regular clustering is a 
one-hot nondistributed code, this histogram is itself a 
distributed representation, and the "soft" k-means [48] 
representation allows not only the nearest filter, but also its 
neighbors to be active. 

11.2.2 Convolutional and Tiled-Convolutional Training 

It is possible to directly train large convolutional layers 
using an unsupervised criterion. An early approach [105] 
trained a standard but deep convolutional MLP on the task 
of denoising images, i.e., as a deep, convolutional, DAE. 
Convolutional versions of the RBM or its extensions have 
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also been developed [57], [135], [202] as well as a 
probabilistic max-pooling operation built into convolutional 
deep networks [135], [136], [115]. Other unsupervised 
feature learning approaches that were adapted to the 
convolutional setting include PSD [110], [111], [106], [84], 
a convolutional version of sparse coding called deconvolu- 
tional networks [226], topographic ICA [122], and mPoT, 
which Kivinen and Williams [113] applied to modeling 
natural textures. Gregor and LeCun [74] and Le et al. [122] 
also demonstrated the technique of tiled convolution, where 
parameters are shared only between feature extractors 
whose receptive fields are k steps away (so the ones looking 
at immediate neighbor locations are not shared). This 
allows pooling units to be invariant to more than just 
translations and is a hybrid between convolutional net¬ 
works and earlier neural networks with local connections 
but no weight sharing [128], [130]. 

11.2.3 Alternatives to Pooling 

Alternatively, one can also use explicit knowledge of the 
expected invariants expressed mathematically to define 
transformations that are robust to a known family of input 
deformations, using the so-called scattering operators [139], 
[40], which can be computed in a way interestingly 
analogous to deep convolutional networks and wavelets. 
Like convolutional networks, the scattering operators 
alternate two types of operations: convolution and pooling 
(as a norm). Unlike convolutional networks, the proposed 
approach keeps at each level all of the information about the 
input (in a way that can be inverted) and automatically 
yields a very sparse (but very high dimensional) represen¬ 
tation. Another difference is that the filters are not learned 
but instead are set so as to guarantee that a priori specified 
invariances are robustly achieved. Just a few levels were 
sufficient to achieve impressive results on several bench¬ 
mark datasets. 

11.3 Temporal Coherence and Slow Features 

The principle of identifying slowly moving/changing 
factors in temporal/spatial data has been investigated by 
many [6], [219], [96], [114], [41] as a principle for finding 
useful representations. In particular, this idea has been 
applied to image sequences and as an explanation for 
why VI simple and complex cells behave the way they do. 
A good overview can be found in [96], [31]. 

More recently, temporal coherence has been successfully 
exploited in deep architectures to model video [147]. It was 
also found that temporal coherence discovered visual 
features similar to those obtained by ordinary unsuper¬ 
vised feature learning [28], and a temporal coherence 
penalty has been combined with a training criterion for 
unsupervised feature learning [227], sparse autoencoders 
with LI regularization, in this case, yielding improved 
classification performance. 

The temporal coherence prior can be expressed in 
several ways, the simplest being the squared difference 
between feature values at times t and t + 1. Other plausible 
temporal coherence priors include the following: First, 
instead of penalizing the squared change, penalizing the 
absolute value (or a similar sparsity penalty) would state 
that most of the time the change should be exactly 0, which 
would intuitively make sense for the real-life factors that 


surround us. Second, one would expect that instead of just 
being slowly changing, different factors could be associated 
with their own different time scale. The specificity of their 
time scale could thus become a hint to disentangle 
explanatory factors. Third, one would expect that some 
factors should really be represented by a group of numbers 
(such as the x, y, and 2 position of some object in space and 
the pose parameters of Hinton et al. [85]) rather than by a 
single scalar and that these groups tend to move together. 
Structured sparsity penalties [110], [107], [3], [76] could be 
used for this purpose. 

11.4 Algorithms to Disentangle Factors of Variation 

The goal of building invariant features is to remove 
sensitivity of the representation to directions of variance 
in the data that are uninformative to the task at hand. 
However, it is often the case that the goal of feature 
extraction is the disentangling or separation of many distinct 
but informative factors in the data, for example, in a video 
of people: subject identity, action performed, subject pose 
relative to the camera, and so on. In this situation, the 
methods of generating invariant features, such as feature 
pooling, may be inadequate. 

The process of building invariant features can be seen as 
consisting of two steps. First, low-level features are 
recovered that account for the data. Second, subsets of 
these low-level features are pooled together to form higher 
level invariant features, exemplified by the pooling and 
subsampling layers of convolutional neural networks. The 
invariant representation formed by the pooling features 
offers an incomplete window on the data as the detailed 
representation of the lower level features is abstracted away 
in the pooling procedure. While we would like higher level 
features to be more abstract and exhibit greater invariance, 
we have little control over what information is lost through 
pooling. What we really would like is for a particular 
feature set to be invariant to the irrelevant features and 
disentangle the relevant features. Unfortunately, it is often 
difficult to determine a priori which set of features will 
ultimately be relevant to the task at hand. 

An interesting approach to taking advantage of some of 
the factors of variation known to exist in the data is the 
transforming autoencoder [85]: Instead of a scalar pattern 
detector (e.g., corresponding to the probability of the 
presence of a particular form in the input), one can think 
of the features as organized in groups that include both a 
pattern detector and pose parameters that specify attributes of 
the detected pattern. In [85], what is assumed a priori is that 
pairs of examples (or consecutive ones) are observed with 
an associated value for the corresponding change in the pose 
parameters. For example, an animal that controls its eyes 
knows what changes to its ocular motor system were applied 
when going from one image on its retina to the next. In that 
work, it is also assumed that the pose changes are the same 
for all the pattern detectors, and this makes sense for global 
changes such as image translation and camera geometry 
changes. Instead, we would like to discover the pose 
parameters and attributes that should be associated with 
each feature detector without having to specify ahead of 
time what they should be, force them to be the same for all 
features, and having to necessarily observe the changes in 
all of the pose parameters or attributes. 
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The approach taken recently in the MTC, discussed in 
Section 8.3, is interesting in this respect. Without any 
supervision or prior knowledge, it finds prominent local 
factors of variation (tangent vectors to the manifold, 
extracted from a CAE, interpreted as locally valid input 
"deformations"). Higher level features are subsequently 
encouraged to be invariant to these factors of variation so 
that they must depend on other characteristics. In a sense, 
this approach is disentangling valid local deformations 
along the data manifold from other, more drastic changes 
associated to other factors of variation such as those that 
affect class identity. 26 

One solution to the problem of information loss that 
would fit within the feature-pooling paradigm is to 
consider many overlapping pools of features based on the 
same low-level feature set. Such a structure would have the 
potential to learn a redundant set of invariant features that 
may not cause significant loss of information. However, it 
is not obvious what learning principle could be applied 
that can ensure that the features are invariant while 
maintaining as much information as possible. While a 
DBN or a DBM (as discussed in Sections 4 and 10.2, 
respectively) with two hidden layers would, in principle, be 
able to preserve information into the "pooling" second 
hidden layer, there is no guarantee that the second layer 
features are more invariant than the "low-level" first layer 
features. However, there is some empirical evidence that 
the second layer of the DBN tends to display more 
invariance than the first layer [64], 

A more principled approach, from the perspective of 
ensuring a more robust compact feature representation, can 
be conceived by reconsidering the disentangling of features 
through the lens of its generative equivalent—feature 
composition. Since many unsupervised learning algorithms 
have a generative interpretation (or a way to reconstruct 
inputs from their high-level representation), the generative 
perspective can provide insight into how to think about 
disentangling factors. The majority of the models currently 
used to construct invariant features have the interpretation 
that their low-level features linearly combine to construct 
the data. 27 This is a fairly rudimentary form of feature 
composition with significant limitations. For example, it is 
not possible to linearly combine a feature with a generic 
transformation (such as translation) to generate a trans¬ 
formed version of the feature. Nor can we even consider a 
generic color feature being linearly combined with a 
grayscale stimulus pattern to generate a colored pattern. It 
would seem that if we are to take the notion of disen¬ 
tangling seriously, we require a richer interaction of 
features than that offered by simple linear combinations. 

12 Conclusion 

This review of representation learning and deep learning 
has covered three major and apparently disconnected 

26. The changes that affect class identity might, in input space, actually 
be of similar magnitude to local deformations, but not follow along the 
manifold, i.e., cross zones of low density. 

27. As an aside, if we are given only the values of the higher level 
pooling features, we cannot accurately recover the data because we do not 
know how to apportion credit for the pooling feature values to the lower 
level features. This is simply the generative version of the consequences of 
the loss of information caused by pooling. 


approaches: the probabilistic models (both the directed 
kind such as sparse coding and the undirected kind such as 
Boltzmann machines), the reconstruction-based algorithms 
related to autoencoders, and the geometrically motivated 
manifold-learning approaches. Drawing connections be¬ 
tween these approaches is currently a very active area of 
research and is likely to continue to produce models and 
methods that take advantage of the relative strengths of 
each paradigm. 

Practical concerns and guidelines. One of the criticisms 
addressed to artificial neural networks and deep learning 
algorithms is that they have many hyperparameters and 
variants and that exploring their configurations and 
architectures is an art. This has motivated an earlier book 
on the "Tricks of the Trade" [156] of which [132] is still 
relevant for training deep architectures, in particular what 
concerns initialization, ill conditioning, and stochastic 
gradient descent. A good and more modern compendium 
of good training practice, particularly adapted to training 
RBMs, is provided in [90], while a similar guide oriented 
more toward deep neural networks can be found in [13], 
both of which are part of a novel version of the above 
book. Recent work on automating hyperparameter search 
[29], [30], [191] is also making it more convenient, efficient, 
and reproducible. 

Incorporating generic Al-level priors. We have covered 
many high-level generic priors that we believe could bring 
machine learning closer to AI by improving representation 
learning. Many of these priors relate to the assumed 
existence of multiple underlying factors of variation, whose 
variations are in some sense orthogonal to each other. They 
are expected to be organized at multiple levels of abstrac¬ 
tion; hence the need for deep architectures, which also have 
statistical advantages because they allow to reuse para¬ 
meters in a combinatorially efficient way. Only a few of 
these factors would typically be relevant for any particular 
example, justifying sparsity of representation. These factors 
are expected to be related to simple (e.g., linear) depen¬ 
dencies, with subsets of these explaining different random 
variables of interest (inputs, tasks) and varying in struc¬ 
tured ways in time and space (temporal and spatial 
coherence). We expect future successful applications of 
representation learning to refine and increase that list of 
priors, and to incorporate most of them instead of focusing 
on only one. Research in training criteria that better take 
these priors into account is likely to move us closer to the 
long-term objective of discovering learning algorithms that 
can disentangle the underlying explanatory factors. 

Inference. We anticipate that methods based on directly 
parameterizing a representation function will incorporate 
more and more of the iterative type of computation one 
finds in the inference procedures of probabilistic latent- 
variable models. There is already movement in the other 
direction, with probabilistic latent-variable models exploit¬ 
ing approximate inference mechanisms that are them¬ 
selves learned (i.e., producing a parametric description of 
the representation function). A major appeal of probabil¬ 
istic models is that the semantics of the latent variables 
are clear, and this allows a clean separation of the 
problems of modeling (choose the energy function), 
inference (estimating P(h \ x)), and learning (optimizing 
the parameters), using generic tools in each case. On the 
other hand, doing approximate inference and not taking 
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that approximation into account explicitly in the approx¬ 
imate optimization for learning could have detrimental 
effects; hence the appeal of learning approximate infer¬ 
ence. More fundamentally, there is the question of the 
multimodality of the posterior P(h \ x). If there are 
exponentially many probable configurations of values of 
the factors h, that can explain x, then we seem to be stuck 
with very poor inference, either focusing on a single mode 
(MAP inference), assuming some kind of strong factoriza¬ 
tion (as in variational inference), or using an MCMC that 
cannot visit enough modes of P(h \ x). What we propose 
as food for thought is the idea of dropping the 
requirement of an explicit representation of the posterior 
and settling for an implicit representation that exploits 
potential structure in P(h\ x) to represent it compactly: 
Even though P(h\x) may have an exponential number of 
modes, it may be possible to represent it with a small set 
of numbers. For example, consider computing a determi¬ 
nistic feature representation f(x) that implicitly captures 
the information about a highly multimodal P(h \ x), in the 
sense that all the questions (e.g., making some prediction 
about some target concept) that can be asked from 
P(h | x) can also be answered from /( x). 

Optimization. Much remains to be done to better under¬ 
stand the successes and failures of training deep architec¬ 
tures, both in the supervised case (with many recent 
successes) and the unsupervised case (where much more 
work needs to be done). Although regularization effects can 
be important on small datasets, the effects that persist on 
very large datasets suggest some optimization issues are 
involved. Are they more due to local minima (we now 
know there are huge numbers of them) and the dynamics of 
the training procedure? Or are they due mostly to ill 
conditioning and may be handled by approximate second- 
order methods? These basic questions remain unanswered 
and deserve much more study. 
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